The goal of this project is to use the tools we have learned in class to predict a categorical variable and a numeric one using classification and regression ideas. The project is divided into two sub parts for each prediction: the first parts is the interpretation, where the main goal is to understand the data and how the variables relate to each other (with the use of statistical tools); and the other part is the prediction, where we try to maximize the accuracy of our model by reducing the cost of it (with the use of more complex and flexible methods such as machine learning).
First and foremost, the cost of the errors in this data set has been invented, therefore they do not reflect the reality. Moreover, some code has been commented such as some models because they take too long to compute, but the results have been added. Otherwise, R cannot perform the conversion to HTML. With this said, let’s start.
workingDirectory = ""
setwd(workingDirectory)
loadLibraries = function(libraries) {
for(lib in libraries) {
eval(parse(text = sprintf("library(%s)", lib)))
#install.packages(lib)
#trycatch
}
}
# All libraries that we are going to use.
libraries = c("VIM", "rlang", "ggplot2", "caret", "tidyverse", "MASS", "e1071",
"GGally", "tm", "SnowballC", "naivebayes", "skimr", "mice",
"glmnet", "rpart", "pROC", "class", "randomForest", "dplyr",
"factoextra", "foreach", "doParallel", "rpart.plot", "ggord", "fastDummies",
"ellipse")
loadLibraries(libraries)
rm(list = ls())
The data set that we are going to be using in this project has been picked from kaggle (https://www.kaggle.com/mojtaba142/hotel-booking). The data is about two hotels (one from a city hotel and the other from a resort hotel) and their customers; the goal of this project is to predict if a customer will cancel the reservation or not given their booked information.
hotel.df = read.csv("hotel_booking.csv", stringsAsFactors = TRUE)
Fist of all, we need to delete these variables (customer data) as they are invented by the creator and we do not really care much about them.
colsToRemove = c("name", "email", "phone.number", "credit_card")
hotel.df = hotel.df[, -which(names(hotel.df) == colsToRemove)]
rm(colsToRemove)
Columns description
Now, let’s take a look at the data and see what we are working with.
summary(hotel.df)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :79330 Min. :0.0000 Min. : 0 Min. :2015
## Resort Hotel:40060 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## August :13877 Min. : 1.00 Min. : 1.0
## July :12661 1st Qu.:16.00 1st Qu.: 8.0
## May :11791 Median :28.00 Median :16.0
## October:11160 Mean :27.17 Mean :15.8
## April :11089 3rd Qu.:38.00 3rd Qu.:23.0
## June :10939 Max. :53.00 Max. :31.0
## (Other):47873
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 BB :92310 PRT :48590
## 1st Qu.: 0.0000 1st Qu.: 0.000000 FB : 798 GBR :12129
## Median : 0.0000 Median : 0.000000 HB :14463 FRA :10415
## Mean : 0.1039 Mean : 0.007949 SC :10650 ESP : 8568
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000 Undefined: 1169 DEU : 7287
## Max. :10.0000 Max. :10.000000 ITA : 3766
## NA's :4 (Other):28635
## market_segment distribution_channel is_repeated_guest
## Online TA :56477 Corporate: 6677 Min. :0.00000
## Offline TA/TO:24219 Direct :14645 1st Qu.:0.00000
## Groups :19811 GDS : 193 Median :0.00000
## Direct :12606 TA/TO :97870 Mean :0.03191
## Corporate : 5295 Undefined: 5 3rd Qu.:0.00000
## Complementary: 743 Max. :1.00000
## (Other) : 239
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 A :85994
## 1st Qu.: 0.00000 1st Qu.: 0.0000 D :19201
## Median : 0.00000 Median : 0.0000 E : 6535
## Mean : 0.08712 Mean : 0.1371 F : 2897
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 G : 2094
## Max. :26.00000 Max. :72.0000 B : 1118
## (Other): 1551
## assigned_room_type booking_changes deposit_type agent
## A :74053 Min. : 0.0000 No Deposit:104641 Min. : 1.00
## D :25322 1st Qu.: 0.0000 Non Refund: 14587 1st Qu.: 9.00
## E : 7806 Median : 0.0000 Refundable: 162 Median : 14.00
## F : 3751 Mean : 0.2211 Mean : 86.69
## G : 2553 3rd Qu.: 0.0000 3rd Qu.:229.00
## C : 2375 Max. :21.0000 Max. :535.00
## (Other): 3530 NA's :16340
## company days_in_waiting_list customer_type
## Min. : 6.0 Min. : 0.000 Contract : 4076
## 1st Qu.: 62.0 1st Qu.: 0.000 Group : 577
## Median :179.0 Median : 0.000 Transient :89613
## Mean :189.3 Mean : 2.321 Transient-Party:25124
## 3rd Qu.:270.0 3rd Qu.: 0.000
## Max. :543.0 Max. :391.000
## NA's :112593
## adr required_car_parking_spaces total_of_special_requests
## Min. : -6.38 Min. :0.00000 Min. :0.0000
## 1st Qu.: 69.29 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 94.58 Median :0.00000 Median :0.0000
## Mean : 101.83 Mean :0.06252 Mean :0.5714
## 3rd Qu.: 126.00 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :5400.00 Max. :8.00000 Max. :5.0000
##
## reservation_status reservation_status_date
## Canceled :43017 2015-10-21: 1461
## Check-Out:75166 2015-07-06: 805
## No-Show : 1207 2016-11-25: 790
## 2015-01-01: 763
## 2016-01-18: 625
## 2015-07-02: 469
## (Other) :114477
aggr(hotel.df, numbers = TRUE, sortVars = TRUE, labels = names(hotel.df), cex.axis = .5, gap = 1)
##
## Variables sorted by number of missings:
## Variable Count
## company 9.430689e-01
## agent 1.368624e-01
## children 3.350364e-05
## hotel 0.000000e+00
## is_canceled 0.000000e+00
## lead_time 0.000000e+00
## arrival_date_year 0.000000e+00
## arrival_date_month 0.000000e+00
## arrival_date_week_number 0.000000e+00
## arrival_date_day_of_month 0.000000e+00
## stays_in_weekend_nights 0.000000e+00
## stays_in_week_nights 0.000000e+00
## adults 0.000000e+00
## babies 0.000000e+00
## meal 0.000000e+00
## country 0.000000e+00
## market_segment 0.000000e+00
## distribution_channel 0.000000e+00
## is_repeated_guest 0.000000e+00
## previous_cancellations 0.000000e+00
## previous_bookings_not_canceled 0.000000e+00
## reserved_room_type 0.000000e+00
## assigned_room_type 0.000000e+00
## booking_changes 0.000000e+00
## deposit_type 0.000000e+00
## days_in_waiting_list 0.000000e+00
## customer_type 0.000000e+00
## adr 0.000000e+00
## required_car_parking_spaces 0.000000e+00
## total_of_special_requests 0.000000e+00
## reservation_status 0.000000e+00
## reservation_status_date 0.000000e+00
We can see that there are some missing values. so we will see if they are mainly one column that we can delete or just rows.
length(which(!complete.cases(hotel.df)))
## [1] 119173
Take a look that there are not that many in comparison with the total number of rows, so we could just delete those customers.
hist(rowMeans(is.na(hotel.df)), xlab = c("Missing values average by rows"), main = c())
Here, we can see that the rows are not really that empty, so let’s look at the columns.
indexesEmptyCols = which(colMeans(is.na(hotel.df)) != 0)
colsWithNA = sort(colMeans(is.na(hotel.df[, indexesEmptyCols])),
decreasing = TRUE)
barplot(colsWithNA, las=2)
The columns company and agent are the ones that are most empty, so let’s remove them.
indexesColsToRemove = c()
for (i in names(colsWithNA)[1:2]) {
indexesColsToRemove = c(indexesColsToRemove, which(names(hotel.df) == i))
}
hotel.df = hotel.df[, -indexesColsToRemove]
Here, we clear our working environment.
rm(list = setdiff(ls(), "hotel.df"))
Let’s take a look at how many missing values there are left.
length(which(!complete.cases(hotel.df)))
## [1] 4
There are only 4, so let’s remove them.
hotel.df = hotel.df[-which(is.na(hotel.df[, 11]) == T), ]
Now, there are sometimes that the missing values are in the factor variables and are not clearly visible. So let’s take a look at the factor variables and see if there are any.
hotel.df = droplevels.data.frame(hotel.df)
colsWithEmptyFactors.df = data.frame(colName = character(),
index = integer(),
naNumber = integer(),
stringsAsFactors = FALSE)
for (i in 1:ncol(hotel.df)) {
levels = levels(hotel.df[, i])
if ("" %in% levels) {
colsWithEmptyFactors.df = rbind(colsWithEmptyFactors.df,
data.frame(colName = names(hotel.df)[i],
index = i, naNumber = 0,
stringsAsFactors = FALSE))
}
}
for (i in 1:nrow(colsWithEmptyFactors.df)) {
index = colsWithEmptyFactors.df$index[i]
level = levels(hotel.df[, index])
level[which(level == "")] = NA
levels(hotel.df[, index]) = level
colsWithEmptyFactors.df[i, 3] = length(which(is.na(hotel.df[, index]) == TRUE))
}
print(colsWithEmptyFactors.df)
## colName index naNumber
## 1 country 14 488
We see that there were hidden! But they are just 488 compared to the total 110.000. So let’s just remove them.
indexesEmptyRows = which(rowMeans(is.na(hotel.df))!= 0)
hotel.df = hotel.df[-indexesEmptyRows,]
print(length(which(is.na(hotel.df) == TRUE)))
## [1] 0
After cleaning the data set, let’s start with some feature analysis and maybe perform a new feature that will be helpful for our models.
rm(list = setdiff(ls(), "hotel.df"))
glimpse(hotel.df)
## Rows: 118,898
## Columns: 30
## $ hotel <fct> Resort Hotel, Resort Hotel, Resort Hote…
## $ is_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time <int> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year <int> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month <fct> July, July, July, July, July, July, Jul…
## $ arrival_date_week_number <int> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights <int> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults <int> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal <fct> BB, BB, BB, BB, BB, BB, BB, FB, BB, HB,…
## $ country <fct> PRT, PRT, GBR, GBR, GBR, GBR, PRT, PRT,…
## $ market_segment <fct> Direct, Direct, Direct, Corporate, Onli…
## $ distribution_channel <fct> Direct, Direct, Direct, Corporate, TA/T…
## $ is_repeated_guest <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type <fct> C, C, A, A, A, A, C, C, A, D, E, D, D, …
## $ assigned_room_type <fct> C, C, C, A, A, A, C, C, A, D, E, D, E, …
## $ booking_changes <int> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type <fct> No Deposit, No Deposit, No Deposit, No …
## $ days_in_waiting_list <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type <fct> Transient, Transient, Transient, Transi…
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests <int> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status <fct> Check-Out, Check-Out, Check-Out, Check-…
## $ reservation_status_date <fct> 2015-07-01, 2015-07-01, 2015-07-02, 201…
At first sight, we see a ton of variables that need to be fixed and tweaked. So let’s start defining a function that will take care for most of the work.
REMOVE = "remove"
TO_FACTOR = "toFactor"
TO_INTEGER = "toInteger"
TO_LOGIC = "toLogic"
TO_CHARACTER = "toCharacter"
editColumn = function(colNames, df, toDo) {
index = which(names(df) == colNames)
if (!is_empty(index)) {
if (toDo == REMOVE) {
df = df[, -index]
} else if (toDo == TO_FACTOR) {
df[, index] = as.factor(df[, index])
} else if (toDo == TO_INTEGER) {
df[, index] = as.integer(df[, index])
} else if (toDo == TO_LOGIC) {
df[, index] = as.logical(df[, index])
} else if (toDo == TO_CHARACTER) {
df[, index] = as.character(df[, index])
} else { errorCondition("Could not perform that action.")}
}
return(df)
}
In the data set, we can clearly see that the our variable that we want to predict is not a very descriptive, so let’s solve this.
hotel.df = editColumn(c("children"), hotel.df, TO_INTEGER)
hotel.df = editColumn(c("is_canceled"), hotel.df, TO_LOGIC)
hotel.df = editColumn(c("is_canceled"), hotel.df, TO_FACTOR)
levels(hotel.df$is_canceled) = c("no", "yes")
hotel.df = droplevels.data.frame(hotel.df)
summary(hotel.df$is_canceled)
## no yes
## 74745 44153
Now, let’s change the arrival_date_month to a number for better date representation.
levels(hotel.df$arrival_date_month)
## [1] "April" "August" "December" "February" "January" "July"
## [7] "June" "March" "May" "November" "October" "September"
months.df = data.frame(months = c("January", "February", "March", "April",
"May", "June", "July", "August", "September",
"October", "November", "December"), number = 1:12)
changeMonthToNumber = function(x) {
index = which(x == months.df$months)
if (!is_empty(index)) {
return(months.df$number[index])
}
return(0)
}
hotel.df = editColumn(c("arrival_date_month"), hotel.df, TO_CHARACTER)
dim(hotel.df$arrival_date_month) = length(hotel.df$arrival_date_month)
hotel.df$arrival_date_month = apply(hotel.df$arrival_date_month, 1, changeMonthToNumber)
summary(hotel.df$arrival_date_month)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 7.000 6.553 9.000 12.000
And we will do the same thing we did to the is_canceled variable to the is_repeated_guest as they are the same type.
hotel.df = editColumn(c("is_repeated_guest"), hotel.df, TO_LOGIC)
hotel.df = editColumn(c("is_repeated_guest"), hotel.df, TO_FACTOR)
levels(hotel.df$is_repeated_guest) = c("no", "yes")
summary(hotel.df$is_repeated_guest)
## no yes
## 115092 3806
Looking at the reservation_status_date, we see that it is a date, but it is not in a very clear format. So let’s change it and make three columns, one for the day, one for the month, and the last for the year.
hotel.df = editColumn(c("reservation_status_date"), hotel.df, TO_CHARACTER)
dim(hotel.df$reservation_status_date) = length(hotel.df$reservation_status_date)
separateDate = function(x) {
return(as.integer(unlist(strsplit(x, "-"))))
}
zero.vector = rep(0, nrow(hotel.df))
reservationDate.df = data.frame(reservation_status_date_year = zero.vector,
reservation_status_date_month = zero.vector,
reservation_status_date_day = zero.vector)
hotel.df = cbind(hotel.df, reservationDate.df)
hotel.df[, (ncol(hotel.df) - 2):(ncol(hotel.df))] = t(apply(hotel.df$reservation_status_date, 1, separateDate))
hotel.df = hotel.df[, -(ncol(hotel.df) - 3)]
And clean the environment.
rm(list = setdiff(ls(), "hotel.df"))
We can detect that the reservation_status is strongly correlated with the is_canceled variable that we are trying to predict, so let’s take a look if that is really the case.
ggplot(data = hotel.df, aes(x = reservation_status, y = is_canceled)) + geom_boxplot()
Here we can clearly see that we were right. Both variables are pretty much the same. Therefore we will delete it and the date according to those variables as they are not meaningful anymore.
hotel.df = hotel.df[, -((ncol(hotel.df) - 3):ncol(hotel.df))]
And, finally let’s make our variable that we want to predict our first one for the shake of simplicity.
colOrder = 1:ncol(hotel.df)
colOrder[2] = 1
colOrder[1] = 2
hotel.df = hotel.df[, colOrder]
Here, we will study the relation of each variable with respect to each other and see if there are any variables worth removing or worth taking more importance.
First, let’s create the dummy variables for our data set. For this, we will remove the countries as it has a lot of factors.
hotel.df = hotel.df[, -which(names(hotel.df) == "country")]
Now let’s look at the reserved_room_type and the assigned_room_type, because it will give us some errors afterwards with the models.
(which(hotel.df$reserved_room_type == "P"))
## [1] 72478 72479
(which(hotel.df$assigned_room_type == "P"))
## [1] 72478 72479
Here we see that there is a room type that only two customers have selected so let’s delete it because otherwise it will break our models.
hotel.df = hotel.df[-which(hotel.df$reserved_room_type == "P"),]
hotel.df = droplevels.data.frame(hotel.df)
After this, let’s use the Fast Dummy library in order to create the dummy variables. But, let’s keep the is_canceled column as a factor as it is the one that we want to predict.
indexFactors = names(which(sapply(hotel.df, is.factor) == TRUE)[])
indexFactors = indexFactors[-1]
hotel.df = dummy_cols(hotel.df, select_columns = indexFactors, remove_selected_columns = TRUE)
summary(hotel.df)
## is_canceled lead_time arrival_date_year arrival_date_month
## no :74745 Min. : 0.0 Min. :2015 Min. : 1.000
## yes:44151 1st Qu.: 18.0 1st Qu.:2016 1st Qu.: 4.000
## Median : 69.0 Median :2016 Median : 7.000
## Mean :104.3 Mean :2016 Mean : 6.553
## 3rd Qu.:161.0 3rd Qu.:2017 3rd Qu.: 9.000
## Max. :737.0 Max. :2017 Max. :12.000
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## Min. : 1.00 Min. : 1.0 Min. : 0.0000
## 1st Qu.:16.00 1st Qu.: 8.0 1st Qu.: 0.0000
## Median :28.00 Median :16.0 Median : 1.0000
## Mean :27.17 Mean :15.8 Mean : 0.9289
## 3rd Qu.:38.00 3rd Qu.:23.0 3rd Qu.: 2.0000
## Max. :53.00 Max. :31.0 Max. :16.0000
## stays_in_week_nights adults children babies
## Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 0.000000
## 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 0.0000 1st Qu.: 0.000000
## Median : 2.000 Median : 2.000 Median : 0.0000 Median : 0.000000
## Mean : 2.502 Mean : 1.858 Mean : 0.1042 Mean : 0.007948
## 3rd Qu.: 3.000 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :41.000 Max. :55.000 Max. :10.0000 Max. :10.000000
## previous_cancellations previous_bookings_not_canceled booking_changes
## Min. : 0.00000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.00000 Median : 0.0000 Median : 0.0000
## Mean : 0.08714 Mean : 0.1316 Mean : 0.2212
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000 Max. :21.0000
## days_in_waiting_list adr required_car_parking_spaces
## Min. : 0.000 Min. : -6.38 Min. :0.00000
## 1st Qu.: 0.000 1st Qu.: 70.00 1st Qu.:0.00000
## Median : 0.000 Median : 95.00 Median :0.00000
## Mean : 2.331 Mean : 102.00 Mean :0.06189
## 3rd Qu.: 0.000 3rd Qu.: 126.00 3rd Qu.:0.00000
## Max. :391.000 Max. :5400.00 Max. :8.00000
## total_of_special_requests hotel_City Hotel hotel_Resort Hotel meal_BB
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:1.0000
## Median :0.0000 Median :1.000 Median :0.000 Median :1.0000
## Mean :0.5717 Mean :0.667 Mean :0.333 Mean :0.7726
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :5.0000 Max. :1.000 Max. :1.000 Max. :1.0000
## meal_FB meal_HB meal_SC meal_Undefined
## Min. :0.000000 Min. :0.0000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.000000 Median :0.0000 Median :0.00000 Median :0.000000
## Mean :0.006712 Mean :0.1214 Mean :0.08946 Mean :0.009798
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.0000 Max. :1.00000 Max. :1.000000
## market_segment_Aviation market_segment_Complementary market_segment_Corporate
## Min. :0.000000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.000000 Median :0.00000
## Mean :0.001993 Mean :0.006174 Mean :0.04299
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.000000 Max. :1.00000
## market_segment_Direct market_segment_Groups market_segment_Offline TA/TO
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.1047 Mean :0.1666 Mean :0.2032
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## market_segment_Online TA distribution_channel_Corporate
## Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000
## Mean :0.4744 Mean :0.05459
## 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000
## distribution_channel_Direct distribution_channel_GDS
## Min. :0.0000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.000000
## Median :0.0000 Median :0.000000
## Mean :0.1218 Mean :0.001623
## 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.000000
## distribution_channel_TA/TO distribution_channel_Undefined is_repeated_guest_no
## Min. :0.000 Min. :0.0e+00 Min. :0.000
## 1st Qu.:1.000 1st Qu.:0.0e+00 1st Qu.:1.000
## Median :1.000 Median :0.0e+00 Median :1.000
## Mean :0.822 Mean :8.4e-06 Mean :0.968
## 3rd Qu.:1.000 3rd Qu.:0.0e+00 3rd Qu.:1.000
## Max. :1.000 Max. :1.0e+00 Max. :1.000
## is_repeated_guest_yes reserved_room_type_A reserved_room_type_B
## Min. :0.00000 Min. :0.00 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00 1st Qu.:0.000000
## Median :0.00000 Median :1.00 Median :0.000000
## Mean :0.03201 Mean :0.72 Mean :0.009369
## 3rd Qu.:0.00000 3rd Qu.:1.00 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00 Max. :1.000000
## reserved_room_type_C reserved_room_type_D reserved_room_type_E
## Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.00783 Mean :0.1613 Mean :0.05464
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.00000
## reserved_room_type_F reserved_room_type_G reserved_room_type_H
## Min. :0.00000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.000000
## Mean :0.02431 Mean :0.01752 Mean :0.005055
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00000 Max. :1.000000
## reserved_room_type_L assigned_room_type_A assigned_room_type_B
## Min. :0.00e+00 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00e+00 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00e+00 Median :1.0000 Median :0.00000
## Mean :5.05e-05 Mean :0.6212 Mean :0.01816
## 3rd Qu.:0.00e+00 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.00e+00 Max. :1.0000 Max. :1.00000
## assigned_room_type_C assigned_room_type_D assigned_room_type_E
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.0198 Mean :0.2117 Mean :0.06508
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
## assigned_room_type_F assigned_room_type_G assigned_room_type_H
## Min. :0.00000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.000000
## Mean :0.03139 Mean :0.02135 Mean :0.005955
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00000 Max. :1.000000
## assigned_room_type_I assigned_room_type_K assigned_room_type_L
## Min. :0.000000 Min. :0.000000 Min. :0.0e+00
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0e+00
## Median :0.000000 Median :0.000000 Median :0.0e+00
## Mean :0.003003 Mean :0.002347 Mean :8.4e-06
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.0e+00
## Max. :1.000000 Max. :1.000000 Max. :1.0e+00
## deposit_type_No Deposit deposit_type_Non Refund deposit_type_Refundable
## Min. :0.0000 Min. :0.0000 Min. :0.000000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.000000
## Median :1.0000 Median :0.0000 Median :0.000000
## Mean :0.8761 Mean :0.1226 Mean :0.001362
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.0000 Max. :1.000000
## customer_type_Contract customer_type_Group customer_type_Transient
## Min. :0.00000 Min. :0.000000 Min. :0.00
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.75
## Median :0.00000 Median :0.000000 Median :1.00
## Mean :0.03428 Mean :0.004794 Mean :0.75
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:1.00
## Max. :1.00000 Max. :1.000000 Max. :1.00
## customer_type_Transient-Party
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2109
## 3rd Qu.:0.0000
## Max. :1.0000
We can see that all the variables are now ready to process. Addressing the problem of “Curse of Dimensionality”, it is true that we have increased it a ton, but as we have a lot more observations, we are good to go without having this problem. It is true that caret would take care of all the dummy creations, but as we want to know more in depth the realtionship we have opted to created the our own.
hotel.df = hotel.df[, -which(names(hotel.df) == "is_repeated_guest_no")]
We remove some extra varaibles that are redundant.
So, let’s take a look at the correlation of the variables.
df = cbind(is_canceled = as.numeric(hotel.df$is_canceled), hotel.df[, -1])
R = cor(df)
ggcorr(df, cor_matrix = R, label = TRUE)
If we zoom a bit, we can tell that the reserved_room_type is really similar to the assigned one. However, we will keep it as our model could look at this difference to predict better. It happens the same with the market and distribution variable.
Also, the created dummy variables are highly correlated but that makes sense, therefore we won’t do anything to them.
Moreover, we see that the arrival_date_week_number is just the same as the month so we will remove it.
hotel.df = hotel.df[, -which(names(hotel.df) == "arrival_date_week_number")]
rm(list = c("R", "colOrder"))
Finally lets take a look at some pca to see which variables are really important, so we understand better what the models will do when they predict.
pca = prcomp(df)
fviz_contrib(pca, choice = "var", axes = 1)
As usual, the pca without scaling is not really that useful, so let’s scale the data.
pca_scaled = prcomp(scale(df))
fviz_contrib(pca_scaled, choice = "var", axes = 1)
Here we can really see that the variables are somewhat correlated. We can see that the date does not really matter, nor the babies or the meal. However the room_type is really important as the deposit_type or the room_type.
Now that we have somewhat an idea about our data set, let’s try to visualize it and see if we find any perceptions that we can further use in our analysis.
rm(list = setdiff(ls(), c("hotel.df", "hotel.numeric.df", "dfToNumeric")))
First things first, let’s plot the correlation of all the variables with respect to the is_canceled one and see if there are any really important ones.
corr_cancelled = sort(cor(hotel.df[,-1])[1,], decreasing = T)
corr = data.frame(corr_cancelled)
ggplot(corr,aes(x = row.names(corr), y = corr_cancelled)) + geom_bar(stat = "identity") +
scale_x_discrete(limits= row.names(corr)) + labs(x = "", y = "is_canceled", title = "Correlations") +
theme(plot.title = element_text(hjust = 0, size = rel(1.5)), axis.text.x = element_text(angle = 45, hjust = 1))
Here, we can clearly see that the hotel is the most important variable, and this could mean that there is one hotel that a lot of customers cancel their reservation. Moreover, the room type is also important as we saw with the PCA. And the date it is not that important.
Now, let’s plot each variable boxplot regarding the is_canceled one.
featurePlot(x = hotel.df[, -1],
y = hotel.df[, 1],
plot = "box",
strip=strip.custom(par.strip.text=list(cex=.7)),
scales = list(x = list(relation="free"),
y = list(relation="free")))
We first see an outlier in the adr so we will remove it. Also we can see that the hotel boxplot is really similar so we will take a further look on that. And the dates are not that important, but the year is. So let’s remove those variables so we reduce some noise.
hotel.df = hotel.df[-which(hotel.df$adr > 1000),]
hotel.df = hotel.df[, -which(names(hotel.df) == "arrival_date_month")]
hotel.df = hotel.df[, -which(names(hotel.df) == "arrival_date_day_of_month")]
Clean the environment.
rm(list = setdiff(ls(), c("hotel.df")))
The first part is to divide the data set into three parts. One for the final testing (only used for the final model chosen), other for the testing of all the models and the last ones for the training. As the data set is huge, we will reduced the training data set size for shake of time and efficiency.
indexes.test.final = createDataPartition(hotel.df$is_canceled, p = 0.3, list = FALSE)
hotel.test.final = hotel.df[indexes.test.final, ]
indexes.test = createDataPartition(hotel.df$is_canceled[-indexes.test.final], p = 0.5, list = FALSE)
hotel.test = hotel.df[indexes.test, ]
hotel.train = hotel.df[-indexes.test, ]
Now, let’s set up caret. It will be our main package that we will use. So let’s define some functions that will speed up our work substantially.
Let’s separate the data set into the independent variables and the dependent one.
X.train = hotel.train[, -1]
y.train = hotel.train[, 1]
X.test = hotel.test[, -1]
y.test = hotel.test[, 1]
Because we are processing really complex models later on, we will be loading the binaries to speed up the process when creating the HTML file (take care all models have been run and tested before hand). Due to this let’s load the training and test data that we used to generate those binaries.
rm(list = setdiff(ls(), c("hotel.df")))
load("bin/data_partition.RData")
In this approach, we will use cross validation 5 fold repeated 3 times. It is not that detailed, but because of the size of the data set otherwise we will spend years training the models. Also we will use parallel computing to speed up the process.
Moreover, we will use the kappa to measure the accuracy of the model, as it is better that the average. As it measures the agreement of different assessments of the model with the same target.
Finally, we will compute the confusion matrix all in this functions, this will make a lot easier the processing of different models. Note that the function has been tune for some data set so they work with the models used in this work.
runCaret = function(method, x, y, grid = NULL, x_test = NULL, y_test = NULL, family_model = NULL, preProcess = NULL) {
env <- foreach:::.foreachGlobals # remove previous parallel computing instances.
rm(list=ls(name=env), pos=env)
if (method == "svmLinearWeights2") {
trainControl = trainControl(method="repeatedcv", number = 5, repeats = 3,
savePredictions=TRUE)
} else {
trainControl = trainControl(method="repeatedcv", number = 5, repeats = 3,
savePredictions=TRUE, classProbs=TRUE)
}
cores = detectCores() - 1
cl = makeCluster(cores)
registerDoParallel(cl)
tryCatch(
if (is.null(family_model)) {
model = train(x = x,
y = y,
method = method,
trControl = trainControl,
tuneGrid = grid,
metric = "Kappa",
preProc = preProcess)
} else {
model = train(x = x,
y = y,
method = method,
trControl = trainControl,
tuneGrid = grid,
metric = "Kappa",
family = family_model,
preProc = preProcess)
})
stopCluster(cl)
pred = predict(model, newdata = x_test)
if (method == "svmLinearWeights2") {
prob = pred
} else {
prob = predict(model, newdata = x_test, type = "prob")
}
CM = confusionMatrix(pred, y_test)
print(paste(method, " used"))
print(CM)
return(list(model, prob, CM))
}
First, we will be using models that are not flexible (low variance, high bias) in order to try to figure out how the variable is_canceled is predicted and how it relates to the others. That is we will first use statistical tools to get a feel of what is happening, and then machine learning tools to really try to predict the best way without taking care of what is going on inside.
In this approach, we will use the most basic tree so we can see what variables are important to predict.
dt_grid = expand.grid(cp = seq(0.2, 2, 0.2))
model.dt = runCaret("rpart", X.train, y.train, grid = dt_grid, x_test = X.test, y_test = y.test)
## [1] "rpart used"
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 19620 14781
## yes 35 7178
##
## Accuracy : 0.644
## 95% CI : (0.6393, 0.6486)
## No Information Rate : 0.5277
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3128
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9982
## Specificity : 0.3269
## Pos Pred Value : 0.5703
## Neg Pred Value : 0.9951
## Prevalence : 0.4723
## Detection Rate : 0.4715
## Detection Prevalence : 0.8267
## Balanced Accuracy : 0.6626
##
## 'Positive' Class : no
##
Looking at the confusion matrix, we see that the accuracy is not that bad (0.64) for this simple model. But the kappa is a little bit terrible (0.31). However, the most concerning part is the false negative (specificity) it is really low (0.33). So we will take a closer look at that.
But first, let’s see the tree.
rpart.plot(model.dt[[1]]$finalModel, digits = 3)
The most important variable for the simple tree is the deposit_type which is directly correlated with the refundable part. This could mean that the customers that can refund their reservation if they cancel are more likely to do that those who don’t. This in the real world has a lot of sense and can be one of the most determinant factors.
With this conclusion, we can asses the hotel manager to not include refund in order to reduce the number of cancellations, however this can have a really strong impact on those. It probably reduce the number of cancellation but it will also reduce the number of reservations.
So let’s keep digging as see if we find more similarities.
#model.lda = runCaret("lda", X.train, y.train, x_test = X.test, y_test = y.test)
Comparing the CM, we see that it has a higher kappa (0.352), accuracy (0.6655) and a higher specificity (0.4). But let’s see why.
As LDA focuses on projecting the features in high dimension space into a lower one, it could have found a better way of separating the data than the decision tree.
#ggord(model.lda[[1]]$finalModel, y.train)
load("bin/lda.RData")
We cannot visualize the LDA in a graph as it has only one LDA so it would be a line. However we can look at the model itself and see if we reach any conclusion.
model.lda[[1]]$finalModel
## Call:
## lda(x, y)
##
## Prior probabilities of groups:
## no yes
## 0.7128531 0.2871469
##
## Group means:
## lead_time arrival_date_year stays_in_weekend_nights stays_in_week_nights
## no 82.23436 2016.245 0.8949719 2.366745
## yes 144.92263 2016.179 0.9137038 2.546483
## adults children babies previous_cancellations
## no 1.842167 0.1075331 0.009511708 0.01265202
## yes 1.897346 0.1075661 0.003469875 0.21864720
## previous_bookings_not_canceled booking_changes days_in_waiting_list
## no 0.16095480 0.28536940 1.244672
## yes 0.02739849 0.09733676 3.594340
## adr required_car_parking_spaces total_of_special_requests
## no 104.1435 0.08094028 0.7479216
## yes 104.6842 0.00000000 0.3298184
## hotel_City Hotel hotel_Resort Hotel meal_BB meal_FB meal_HB
## no 0.7420948 0.2579052 0.7627156 0.002777274 0.1112180
## yes 0.7527827 0.2472173 0.7809923 0.010274436 0.1136497
## meal_SC meal_Undefined market_segment_Aviation
## no 0.11512071 0.008168452 0.003358141
## yes 0.08845929 0.006624307 0.001351899
## market_segment_Complementary market_segment_Corporate market_segment_Direct
## no 0.008132147 0.05029951 0.13064077
## yes 0.002568609 0.02356811 0.04452255
## market_segment_Groups market_segment_Offline TA/TO market_segment_Online TA
## no 0.09054275 0.2059539 0.5110728
## yes 0.27344419 0.1896264 0.4649182
## distribution_channel_Corporate distribution_channel_Direct
## no 0.06070067 0.14592485
## yes 0.03569014 0.05889775
## distribution_channel_GDS distribution_channel_TA/TO
## no 0.0028135778 0.7905427
## yes 0.0009012663 0.9045108
## distribution_channel_Undefined is_repeated_guest_no is_repeated_guest_yes
## no 1.815211e-05 0.9631875 0.03681249
## yes 0.000000e+00 0.9862557 0.01374431
## reserved_room_type_A reserved_room_type_B reserved_room_type_C
## no 0.7103467 0.010600835 0.005863133
## yes 0.7602181 0.008336713 0.007210130
## reserved_room_type_D reserved_room_type_E reserved_room_type_F
## no 0.1785079 0.04955527 0.02652024
## yes 0.1372629 0.04402686 0.02032355
## reserved_room_type_G reserved_room_type_H reserved_room_type_L
## no 0.01515702 0.003412598 3.630423e-05
## yes 0.01658330 0.005948357 9.012663e-05
## assigned_room_type_A assigned_room_type_B assigned_room_type_C
## no 0.5821928 0.02292612 0.01849700
## yes 0.7446713 0.01203190 0.01013925
## assigned_room_type_D assigned_room_type_E assigned_room_type_F
## no 0.2452532 0.06369577 0.03459793
## yes 0.1430760 0.04501825 0.02154026
## assigned_room_type_G assigned_room_type_H assigned_room_type_I
## no 0.02080232 0.004211291 3.249229e-03
## yes 0.01698887 0.006038484 4.506331e-05
## assigned_room_type_K assigned_room_type_L deposit_type_No Deposit
## no 0.0045743329 0.000000e+00 0.9977673
## yes 0.0004055698 4.506331e-05 0.6698662
## deposit_type_Non Refund deposit_type_Refundable customer_type_Contract
## no 0.001052823 0.001179887 0.02597568
## yes 0.329007255 0.001126583 0.02775900
## customer_type_Group customer_type_Transient customer_type_Transient-Party
## no 0.005536395 0.7276275 0.2408604
## yes 0.001442026 0.8236222 0.1471768
##
## Coefficients of linear discriminants:
## LD1
## lead_time 0.0023088060
## arrival_date_year -0.0993533473
## stays_in_weekend_nights 0.0348519190
## stays_in_week_nights 0.0342858639
## adults 0.0985704560
## children 0.1884694879
## babies 0.1685903436
## previous_cancellations 0.1002348139
## previous_bookings_not_canceled -0.0120161391
## booking_changes -0.1806627172
## days_in_waiting_list 0.0006315751
## adr 0.0024306450
## required_car_parking_spaces -1.0087587345
## total_of_special_requests -0.4765361745
## hotel_City Hotel -0.2017285049
## hotel_Resort Hotel 0.2017285049
## meal_BB 0.0318668699
## meal_FB 0.0624746072
## meal_HB -0.0469001590
## meal_SC 0.0155284879
## meal_Undefined -0.3663322946
## market_segment_Aviation -0.1946696110
## market_segment_Complementary 0.5075907328
## market_segment_Corporate -0.0592958179
## market_segment_Direct -0.0755899968
## market_segment_Groups -0.0860662318
## market_segment_Offline TA/TO -0.4551610593
## market_segment_Online TA 0.3598742692
## distribution_channel_Corporate 0.0029360137
## distribution_channel_Direct -0.1431182750
## distribution_channel_GDS -0.5218723213
## distribution_channel_TA/TO 0.1121797876
## distribution_channel_Undefined -0.0206218483
## is_repeated_guest_no -0.0982915220
## is_repeated_guest_yes 0.0982915220
## reserved_room_type_A -0.3053136001
## reserved_room_type_B 0.0361748955
## reserved_room_type_C 0.4526522490
## reserved_room_type_D 0.1214506646
## reserved_room_type_E 0.4975207883
## reserved_room_type_F 0.2750360913
## reserved_room_type_G 0.5986783870
## reserved_room_type_H 0.5398921771
## reserved_room_type_L 0.3492533628
## assigned_room_type_A 0.4000799557
## assigned_room_type_B 0.0483964991
## assigned_room_type_C -0.3532595424
## assigned_room_type_D -0.1653148913
## assigned_room_type_E -0.4177191541
## assigned_room_type_F -0.5591427321
## assigned_room_type_G -0.7742871202
## assigned_room_type_H -0.4482696255
## assigned_room_type_I -0.7554780351
## assigned_room_type_K -0.1596178765
## assigned_room_type_L 3.1990086797
## deposit_type_No Deposit -1.5333419340
## deposit_type_Non Refund 1.5802524611
## deposit_type_Refundable -1.3362532413
## customer_type_Contract -0.1273727652
## customer_type_Group -0.2095492887
## customer_type_Transient 0.1215072971
## customer_type_Transient-Party -0.1084312817
If we look at the coefficients for LDA1, we see that the deposit_type is the variable with the highest absolute value, therefore the most important (as the decision tree told us). More over there are other not so important variables such as required_car_parking_spaces, total_of_special_requests or the market_segment.
Assigned_room_type_L (positive): it looks like the clients that where assigned the room L canceled it. So we can conclude that room type L is pretty bad.
Deposit_type_no_deposit and deposit_type_refundable (negative): this could reflect that those customers that are not able to get back their money or have not put any money are less likely to cancel.
Deposit_type_non_refund (positive): if a customer is not eligible for a refund, they are more likely to cancel the reservation.
Required_car_parking_spaces (negative): this could mean that if a customer requires parking, they are less likely to cancel the reservation.
Total_of_special_requests (negative): this could mean that if a customer has a lot of special requests (is picky), then the customer is likely to really want the hotel and have less probability to cancel it.
Market_segment_complementary (positive): this could mean that all the complementary customers are likely to cancel.
plot(varImp(model.lda[[1]], scale = F), scales = list(y = list(cex = .95)))
When we take a look at the variable importance of the LDA model, we can see that the conclusions are different. This is because just at looking at the coefficients does not give us any real hint because having a strong relationship with the predicted variable does not mean that is that important to the model. One hypothesis can be if the predictor variable has little variance and the predicted variable has high variance, then a standard 1 increase in the predictor variable can lead to a big increase in the predicted variable.
Here we can see that the most important variables are lead_time, deposit_type (as before) and total_of_special_guests.
plot.roc(y.test, model.lda[[2]][,2], col="darkblue", print.auc = TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green", "red"), max.auc.polygon=TRUE,
auc.polygon.col="lightblue", print.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
After computing the ROC curve we can see that the AUC is 0.83 and we will use it to compare to other models to see which one is better.
# runCaret("lda", X.train, y.train, x_test = X.test, y_test = y.test, preProcess = c("center", "scale"))
Output
[1] "lda used"
Confusion Matrix and Statistics
Reference
Prediction no yes
no 19280 13543
yes 375 8416
Accuracy : 0.6655
95% CI : (0.661, 0.6701)
No Information Rate : 0.5277
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3518
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9809
Specificity : 0.3833
Pos Pred Value : 0.5874
Neg Pred Value : 0.9573
Prevalence : 0.4723
Detection Rate : 0.4633
Detection Prevalence : 0.7887
Balanced Accuracy : 0.6821
'Positive' Class : no
If we scale and center the data, then there is no notizable improvement, and a decrease in the kappa. This can be as we have so many dummy variables that the data does not have its columns really that different.
#model.qda = runCaret("qda", X.train, y.train, x_test = X.test, y_test = y.test)
Output
Something is wrong; all the Kappa metric values are missing:
Accuracy Kappa
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA's :1 NA's :1
We cannot derive any meaningful information from the QDA, therefore we will skip it.
prop.table(table(y.train))
## y.train
## no yes
## 0.7128531 0.2871469
Here we see that using a Naïve classification of just predicting all the customers will not cancel the reservation is not a good idea because our accuracy will just only be 0.71. It is true that with the tools that we have used so far, it is the best prediction but let’ see if we can improve it.
#naive_bayes_grid = expand.grid(laplace = c(0,0.5,1.0), usekernel = TRUE, adjust=c(0,0.5,1.0))
#model.nb = runCaret("naive_bayes", X.train, y.train, grid = naive_bayes_grid ,x_test = X.test, y_test = y.test)
load("bin/nb.RData")
model.nb[[3]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 19616 14818
## yes 39 7141
##
## Accuracy : 0.643
## 95% CI : (0.6384, 0.6476)
## No Information Rate : 0.5277
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3109
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9980
## Specificity : 0.3252
## Pos Pred Value : 0.5697
## Neg Pred Value : 0.9946
## Prevalence : 0.4723
## Detection Rate : 0.4714
## Detection Prevalence : 0.8275
## Balanced Accuracy : 0.6616
##
## 'Positive' Class : no
##
Using the Naïve Bayes approach, we see that the accuracy and the kappa decrease. However, let’s see if the variables are similar to the LDA.
plot(varImp(model.nb[[1]], scale = F), scales = list(y = list(cex = .95)))
Here, we see that the most important variables are the lead_time, deposit_type, and total_special_requests. The last two are repeated and have almost the same value as the first one so our hypothesis is supported by this model also.
Also we will add to the hypothesis, that the earlier a customer has booked the hotel (in comparison to their arrival date) the less likely they are to cancel the reservation on average.
Here we are using a model similar to the multinomial bayesian model that is provided by R.
# naive_bayes_multinomial_grid = expand.grid(estimateTheta = seq(0, 2, 0.2))
#model.nb.multinomail = runCaret("vbmpRadial", X.train, y.train, grid = naive_bayes_multinomial_grid ,x_test = X.test, y_test = y.test)
Does not work, as our data set is very big to use this model.
# model.lr = runCaret("glm", X.train, y.train, x_test = X.test, y_test = y.test, family_model = "binomial")
load("bin/logr.RData")
model.lr[[3]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 19334 14246
## yes 321 7713
##
## Accuracy : 0.6499
## 95% CI : (0.6453, 0.6545)
## No Information Rate : 0.5277
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3229
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9837
## Specificity : 0.3512
## Pos Pred Value : 0.5758
## Neg Pred Value : 0.9600
## Prevalence : 0.4723
## Detection Rate : 0.4646
## Detection Prevalence : 0.8069
## Balanced Accuracy : 0.6675
##
## 'Positive' Class : no
##
After running the logistic regression, we can tell that it is performing worse than the LDA, but let’s see if we can draw some conclusions.
model.lr[[1]]$finalModel$coefficients[sort(abs(model.lr[[1]]$finalModel$coefficients), decreasing = T, index.return = T)[[2]]]
## (Intercept) assigned_room_type_C
## 1.691838e+17 -1.155327e+15
## `market_segment_Offline TA/TO` market_segment_Groups
## 3.374083e+14 7.997127e+14
## required_car_parking_spaces distribution_channel_Corporate
## -3.526467e+15 -3.599349e+15
## assigned_room_type_A reserved_room_type_B
## -6.521884e+14 5.392365e+14
## assigned_room_type_D `market_segment_Online TA`
## -1.691994e+15 NA
## reserved_room_type_C assigned_room_type_B
## 9.098544e+14 -8.723676e+14
## assigned_room_type_F reserved_room_type_L
## -2.628737e+15 NA
## reserved_room_type_H reserved_room_type_A
## 2.835641e+15 3.201468e+14
## is_repeated_guest_yes reserved_room_type_G
## NA 3.249183e+15
## is_repeated_guest_no reserved_room_type_F
## 3.517116e+14 1.933752e+15
## distribution_channel_Undefined reserved_room_type_E
## NA 1.701938e+15
## market_segment_Corporate market_segment_Aviation
## 6.948822e+14 5.473248e+14
## reserved_room_type_D assigned_room_type_E
## 1.204380e+15 -2.070179e+15
## meal_Undefined meal_SC
## NA -2.058664e+14
## `distribution_channel_TA/TO` assigned_room_type_H
## -3.524544e+15 -2.684889e+15
## market_segment_Complementary distribution_channel_Direct
## 5.662339e+14 -3.776221e+15
## previous_cancellations market_segment_Direct
## 3.381342e+14 4.937290e+14
## distribution_channel_GDS meal_HB
## -3.077365e+15 1.242088e+13
## booking_changes assigned_room_type_I
## -2.052987e+14 -4.418622e+15
## babies `hotel_City Hotel`
## -1.890071e+14 -1.311519e+14
## previous_bookings_not_canceled meal_BB
## -1.204735e+14 9.936717e+13
## total_of_special_requests `hotel_Resort Hotel`
## 1.011638e+14 NA
## assigned_room_type_G arrival_date_year
## -3.377247e+15 -8.299049e+13
## stays_in_weekend_nights children
## -2.259539e+13 -1.465835e+13
## adults meal_FB
## -1.447164e+13 1.180741e+14
## stays_in_week_nights days_in_waiting_list
## -1.151472e+13 1.079807e+12
## adr lead_time
## 9.227722e+11 4.889015e+11
Here we can see that the most important variables were: assigned_room_type_C, market_segment_Offline TA/TO, market_segment_Groups, required_car_parking_spaces, and distribution_channel_Corporate. For this, we clearly see that there are some similarities with LDA but not that many. This enforces our hypothesis with the market_segment_Groups, distribution_channel_Corporate, and required_car_parking_spaces but not the rest. This could be because the model is focusing on less important variables therefore giving a inferior kappa and accuracy.
plot(varImp(model.lr[[1]], scale = F), scales = list(y = list(cex = .95)))
Here, we see that if a customer is required a parking space then it is less likely to cancel the reservation on average (negative beta). And also customers that have cancelled before are more likely to cancel on average.
After using all the statistical tools, we conclude with the following hypothesis:
If we think a bit, this really make sense and we can asses the hotels in order to reduce the number of clients that cancel. It is true that we can differentiate between both hotels and see if the customer that leave one hotel are differents from the other hotel, but it is far from the goal of this project.
After taking a look at the best variables for our model, let’s see if we can improve the kappa and the accuracy by using ML tools. For this we will use more flexible models with higher variance and see if the increase in variance accounts for a decrease in bias so we can predict better. Using this tools can be tricky and for the longest models, the binaries are going to be given or the output, otherwise it will take years to process.
As the Logistic Regression has performed worse on average, let’s see if the Penalized Logistic Regression is better.
# model.plr = runCaret("plr", X.train, y.train, x_test = X.test, y_test = y.test)
load("bin/plr.RData")
Output
model.plr[[3]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 18792 11110
## yes 863 10849
##
## Accuracy : 0.7123
## 95% CI : (0.7079, 0.7166)
## No Information Rate : 0.5277
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4382
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9561
## Specificity : 0.4941
## Pos Pred Value : 0.6285
## Neg Pred Value : 0.9263
## Prevalence : 0.4723
## Detection Rate : 0.4516
## Detection Prevalence : 0.7186
## Balanced Accuracy : 0.7251
##
## 'Positive' Class : no
##
According to the confusion matrix we can see a huge increase in accuracy and kappa. If we dig deeper, we can see that PLR has increase substantially the specificity (what we really want because of minimum cost) with with a trade off in sensitivity (which is understandable).
model.plr[[1]]$finalModel$coefficients[sort(abs(model.plr[[1]]$finalModel$coefficients), decreasing = T, index.return = T)[[2]][1:10]]
## required_car_parking_spaces deposit_type_Non Refund
## 4.8094786 -3.5177018
## previous_cancellations deposit_type_No Deposit
## -2.8539411 2.1398114
## assigned_room_type_A deposit_type_Refundable
## -1.6046853 1.3672987
## reserved_room_type_A assigned_room_type_B
## 1.2582918 -0.9973125
## assigned_room_type_I assigned_room_type_G
## 0.9780396 0.8653565
Here the predictors with the highest coefficients are deposit_type again, required_car_parking again, and previous_cancellations. Here we can really see that the first two must be really important to predict whether a client is going to cancel the reservation or not.
plot(varImp(model.plr[[1]], scale = F), scales = list(y = list(cex = .95)))
Here in the variable importance, we see the almost the same as the LDA, so our previous hypothesis is reinforces and we can say that in order to know if a customer is going to cancel the reservation on not, we could see those variables and assess the hotel to improve those in order to give a better service and reduce the cancellation ratio.
# knn_grid = expand.grid(k = seq(2, 50, 4))
# model.knn = runCaret("knn", X.train, y.train, grid = knn_grid ,x_test = X.test, y_test = y.test)
load("bin/knn.RData")
model.knn[[3]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 16667 8071
## yes 2988 13888
##
## Accuracy : 0.7342
## 95% CI : (0.73, 0.7385)
## No Information Rate : 0.5277
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.474
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8480
## Specificity : 0.6325
## Pos Pred Value : 0.6737
## Neg Pred Value : 0.8229
## Prevalence : 0.4723
## Detection Rate : 0.4005
## Detection Prevalence : 0.5945
## Balanced Accuracy : 0.7402
##
## 'Positive' Class : no
##
Here we see that the kappa and the accuracy are better that Penalized Logistic Regression. And the specificity is much better, this is because of the trade off between variance and bias, we have used a more flexible model to predict better the bias with the trade off of a higher variance. Let’s see which are the most important variables for KNN.
plot(varImp(model.knn[[1]], scale = F), scales = list(y = list(cex = .95)))
As our hypothesis said, lead_time and deposit_type were really important, followed by special_requests. Therefore, we are pretty confident that the hotel will decrease the cancellation rate if it follows our recommendations.
# svm_grid = expand.grid(cost = c(1, 2, 3), Loss = c(0.2, 0.5, 0.8), weight = c(0.5, 1, 1.5))
# model.svm = runCaret("svmLinearWeights2", X.train, y.train, grid = svm_grid ,x_test = X.test, y_test = y.test)
load("bin/svm.RData")
model.svm[[3]]
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 19631 16216
## yes 24 5743
##
## Accuracy : 0.6097
## 95% CI : (0.605, 0.6144)
## No Information Rate : 0.5277
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2495
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9988
## Specificity : 0.2615
## Pos Pred Value : 0.5476
## Neg Pred Value : 0.9958
## Prevalence : 0.4723
## Detection Rate : 0.4717
## Detection Prevalence : 0.8614
## Balanced Accuracy : 0.6302
##
## 'Positive' Class : no
##
After computing SVM, we see that the accuracy and the kappa are terrible so we won’t handle this model.
# rf_grid = expand.grid(mtry = seq(1, 30, 3))
# len = as.integer(nrow(hotel.train) / 2)
# model.rf = runCaret("rf", X.train[1:len, ], y.train[1:len], grid = rf_grid ,x_test = X.test[1:len, ], y_test = y.test[1:len])
After trying multiple ways to try to perform Random Forest, we have concluded that with this data set cannot be performed. We have tried to shrink the rows by more than half, the columns also, and even perform less trees. However, in every try that we did, R just stopped and R Studio when blank.
# xgb_grid = expand.grid(nrounds = c(500,1000), max_depth = c(5,6,7), eta = c(0.01, 0.1, 1), gamma = c(1, 2, 3), colsample_bytree = c(1, 2), min_child_weight = c(1), subsample = c(0.2,0.5,0.8))
# model.xgb = runCaret("xgbTree", X.train, y.train, grid = xgb_grid ,x_test = X.test, y_test = y.test)
It happened the same with gradient boosting, our computer just could not handle the insane model for this insane data. It did not matter if we reduced the obervations, the predictor variables, or the hyper parameters. Even after spending two days processing, nothing happened. Therefore, we have opted out to leave it.
Now, let’s see which is the best threshold for the probabilities to assign on observation as canceled or not. For this we will imagine some costs and then fit the model to reduce this cost. For this as we have too many observations we will just use some wide steps and the ROC curve. Also for the complex model we won’t use it otherwise the computer would spend weeks computing the results.
cost.unit <- c(0, 50, 250, 70)
Let’s use it for LDA as it is a simple model and it gave us pretty good kappa and accuracy
model.lda = runCaret("lda", X.train, y.train, x_test = X.test, y_test = y.test)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in lda.default(x, grouping, ...): variables are collinear
## [1] "lda used"
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 19280 13543
## yes 375 8416
##
## Accuracy : 0.6655
## 95% CI : (0.661, 0.6701)
## No Information Rate : 0.5277
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3518
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9809
## Specificity : 0.3833
## Pos Pred Value : 0.5874
## Neg Pred Value : 0.9573
## Prevalence : 0.4723
## Detection Rate : 0.4633
## Detection Prevalence : 0.7887
## Balanced Accuracy : 0.6821
##
## 'Positive' Class : no
##
plot.roc(y.test, model.lda[[2]][,2], col="darkblue", print.auc = TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green", "red"), max.auc.polygon=TRUE,
auc.polygon.col="lightblue", print.thres=TRUE)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
EconomicCost <- function(data, lev = NULL, model = NULL) {
y.pred = data$pred
y.true = data$obs
CM = confusionMatrix(y.pred, y.true)$table
out = sum(as.vector(CM)*cost.unit)/sum(CM)
names(out) <- c("EconomicCost")
out
}
This is our function that predicts our cost for the model. So let’s see what is the cost for the LDA.
pred.lda = as.factor(ifelse(model.lda[[2]]$yes >= 0.5, "yes", "no"))
EconomicCost(data = data.frame(pred = pred.lda, obs = y.test))
## EconomicCost
## 95.96818
env <- foreach:::.foreachGlobals # remove previous parallel computing instances.
rm(list=ls(name=env), pos=env)
ctrl = trainControl(method="repeatedcv", number = 5, repeats = 3,
savePredictions=TRUE, summaryFunction = EconomicCost)
model.lda.cost = train(x = X.train,y = y.train, method = "lda", trControl = ctrl, metric = "EconomicCost", maximize = F)
## Warning: model fit failed for Fold1.Rep1: parameter=none Error in lda.default(x, grouping, ...) :
## variable 33 appears to be constant within groups
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning: model fit failed for Fold5.Rep1: parameter=none Error in lda.default(x, grouping, ...) :
## variable 55 appears to be constant within groups
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning: model fit failed for Fold2.Rep2: parameter=none Error in lda.default(x, grouping, ...) :
## variable 33 appears to be constant within groups
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning: model fit failed for Fold4.Rep2: parameter=none Error in lda.default(x, grouping, ...) :
## variable 55 appears to be constant within groups
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning: model fit failed for Fold1.Rep3: parameter=none Error in lda.default(x, grouping, ...) :
## variable 55 appears to be constant within groups
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning: model fit failed for Fold5.Rep3: parameter=none Error in lda.default(x, grouping, ...) :
## variable 33 appears to be constant within groups
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in lda.default(x, grouping, ...): variables are collinear
model.lda.cost
## Linear Discriminant Analysis
##
## 77281 samples
## 62 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 61825, 61825, 61824, 61825, 61825, 61825, ...
## Resampling results:
##
## EconomicCost
## 52.41088
Here we see that the cost is 52 compared to 95 gotten before. This is a huge improvement! Let’s see the most important variables.
plot(varImp(model.lda.cost, scale = F), scales = list(y = list(cex = .95)))
Here we just see similar results as the LDA computes, therefore we will assume that caret has tweaked the betas but kept the same relationship.
rm(list = setdiff(ls(), c("hotel.df")))
load("bin/data_partition.RData")
Now, to make our final prediction, we will be using the best three models that in our case were: Penalized Logistic Regression, KNN, and LDA. We won’t be using the cost improved LDA as the others are not improve so we will keep it the same for this example. We could improve the other two but the computer would spend weeks processing.
We are using the best three models to predict our final data set.
For this, let’s import the three models
load("bin/lda.RData")
load("bin/knn.RData")
load("bin/plr.RData")
We are going to use the mode of the models.
mode = function(v) {
uniqv = unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
finalPrediction = function(data) {
pred.lda = predict(model.lda[[1]], newdata = data)
pred.plr = predict(model.plr[[1]], newdata = data)
pred.knn = predict(model.knn[[1]], newdata = data)
finalPred = apply(data.frame(pred.lda, pred.plr, pred.knn), 1, mode)
return(finalPred)
}
Here we construct the function that will take care of predicting at each model and making the mode. So, now that that’s done, let’s use the final test data set and see how well can we predict the data. For this let’s separate the data first.
X.test.final = hotel.test.final[, -1]
y.test.final = hotel.test.final[, 1]
# pred.final = as.factor(finalPrediction(X.test.final))
load("bin/classification.pred.final.RData")
confusionMatrix(pred.final, y.test.final)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 22130 7114
## yes 294 6131
##
## Accuracy : 0.7923
## 95% CI : (0.7881, 0.7965)
## No Information Rate : 0.6287
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5028
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9869
## Specificity : 0.4629
## Pos Pred Value : 0.7567
## Neg Pred Value : 0.9542
## Prevalence : 0.6287
## Detection Rate : 0.6204
## Detection Prevalence : 0.8199
## Balanced Accuracy : 0.7249
##
## 'Positive' Class : no
##
After looking at the confusion matrix it gives an accuracy of almost 0.8 and a kappa of 0.5. That is amazing! It is almost a 15% increase from the highest single model that we had.
We see that we can predict the customer that won’t cancel pretty accurately but not so much the customers that will cancel. For this we could do the cost approach that we did previously with the final model, however it is out of the scope of this project.
In conclusion, we could predict if a customer would cancel with approximately 80% confidence. This is pretty high and this will be crucial for the hotels.
For this approach we will use the variable “adr” (average daily rate) to try to predict for a specific customer. For this approach let’s tweak a little bit our data set.
rm(list = setdiff(ls(), c("hotel.df")))
adrCol = which(names(hotel.df) == "adr")
colOrder = 1:ncol(hotel.df)
colOrder[adrCol] = 1
colOrder[1] = adrCol
hotel.df = hotel.df[, colOrder]
We made the adr variable our first one for shake of simplicity.
hotel.df = hotel.df[, -which(names(hotel.df) == "is_canceled")]
We remove these variables, as we are not trying to predict the cancelation.
And now, let’s divide the data set.
indexes.test.final = createDataPartition(hotel.df$adr, p = 0.3, list = FALSE)
hotel.test.final = hotel.df[indexes.test.final, ]
indexes.test = createDataPartition(hotel.df$adr[-indexes.test.final], p = 0.5, list = FALSE)
hotel.test = hotel.df[indexes.test, ]
hotel.train = hotel.df[-indexes.test, ]
X.train = hotel.train[, -1]
y.train = hotel.train[, 1]
X.test = hotel.test[, -1]
y.test = hotel.test[, 1]
Now, we have our data set ready to perform regression.
rm(list = setdiff(ls(), c("hotel.df")))
load("bin/data_partition_regression.RData")
This is a check point with the data loaded.
runCaret = function(method, form, data, test, grid = NULL, family_model = NULL, preProcess = NULL) {
env <- foreach:::.foreachGlobals # remove previous parallel computing instances.
rm(list=ls(name=env), pos=env)
trainControl = trainControl(method="repeatedcv", number = 5, repeats = 3, allowParallel = T)
cores = detectCores() - 1
cl = makeCluster(cores)
registerDoParallel(cl)
tryCatch(
if (is.null(family_model)) {
model = train(form,
data,
method = method,
trControl = trainControl,
tuneGrid = grid,
preProc = preProcess)
} else {
model = train(form,
data,
method = method,
trControl = trainControl,
tuneGrid = grid,
family = family_model,
preProc = preProcess)
})
stopCluster(cl)
pred = predict(model, test[, -1])
cor.pred = cor(test[, 1], pred)^2
print(paste(method, " used"))
print(model)
print(paste("R^2 of the testing is ", cor.pred))
return(list(model, pred, cor.pred))
}
We have tweaked slightly the function to acomodate better regression.
Here let’s start first to visualized the data according to the adr variable
ggcorr(hotel.df, label = TRUE)
If we just look at the adr correlations, we do not see any meaningful correlation with the other variables. However we see that the assigned_rooms have a really high correlation. This is clearly known as they come from the same factor variable and are just dummies. Therefore, we will have to take care with the collinearity in our models.
corr_price <- sort(cor(hotel.train)["adr",], decreasing = T)
## Warning in cor(hotel.train): the standard deviation is zero
corr=data.frame(corr_price)
ggplot(corr,aes(x = row.names(corr), y = corr_price)) +
geom_bar(stat = "identity", fill = "lightblue") +
scale_x_discrete(limits= row.names(corr)) +
labs(x = "Predictors", y = "Price", title = "Correlations") +
theme(plot.title = element_text(hjust = 0, size = rel(1.5)),
axis.text.x = element_text(angle = 45, hjust = 1))
Here we see a really interesting graph. We see that the number of children and adults is strongly positively correlated with adr, and also the type of room. Moreover, other types of rooms such as A are negatively correlated (guess it is the cheapest type of room).
rm(list = c("corr_price", "corr"))
Now let’s see the distribution of the adr.
hotel.train %>% ggplot(aes(x=adr)) + geom_density(fill = "black")
The distribution is not any smooth one, however we see like a sum of a normal distribution with median 100 aprox and a streched right side, and something weird at the beginning. So what about the log.
hotel.train %>% ggplot(aes(x=adr)) + geom_density(fill="black") + scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1233 rows containing non-finite values (stat_density).
Here we do see some improvement, and we have removed the left weird part. However, as there is not noticable improvement, let’s use the normal adr without any change.
ggplot(hotel.train, aes(x = (adults + children + babies), y=adr)) + ylab("adr") +
geom_point(alpha=0.6) + ggtitle("Price per person")
Here we see some really interesting patterns, we see that the price slightly increases with the number of people at the beginning (with some outliers), but there are some really far points with adr = 0. Let’s investigate a little bit.
weirdIndexes = which(hotel.train$adr == 0 & (hotel.train$adults + hotel.train$children + hotel.train$babies) >= 15)
View(hotel.train[weirdIndexes, ])
This looks like a promotion that the hotel did, so we will remove it, otherwise it will introduce a lot of noise in our model.
hotel.train = hotel.df[-weirdIndexes, ]
Now let’s clean the environment and perform some simple regressions to try to understand better the data.
rm(list = ls())
load("bin/data_partition_regression_ready.RData")
Collinearity: as stated before, we will have to take real care as the majority of our columns come from the creation of dummy variables.
Overfitting: because of the large amount of columns that we have, we could have overfitting on our models. However it is true that the ration of rows vs columns is really high so this problem is slightly mitigated. The problem with overfitting will only be aware when we try to estimate the betas, not predicting.
Let’s try fitting a simple MR and see the huge overfitting.
model.mr = runCaret("lm", adr ~ ., hotel.train, test = hotel.test)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## [1] "lm used"
## Linear Regression
##
## 118888 samples
## 60 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 95111, 95110, 95111, 95110, 95110, 95111, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 36.13239 0.4354296 25.97487
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## [1] "R^2 of the testing is 0.410757105804983"
Our R² in the testing set using this model is 0.41. This is pretty bad and the first guess if because of the huge amount of overfitting that it has. However, let’s see if we can draw some meaningful conclusions.
plot(varImp(model.mr[[1]], scale = F), scales = list(y = list(cex = .95)))
Here we see that the most important variable is the one giving if the Hotel is City Hotel. This can be because City Hotel is much expensive than the other one. This is a reasonable approach and let’s see if it is true.
mean(hotel.train$adr[which(hotel.train$`hotel_City Hotel` == 1)])
## [1] 105.2624
mean(hotel.train$adr[which(hotel.train$`hotel_City Hotel` == 0)])
## [1] 95.33249
Here we see that on average, City Hotel is 10 units more expensive than the other hotel. So this predictor will be very handy to predict the price per night.
model.mr[[1]]$finalModel
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Coefficients:
## (Intercept) lead_time
## -1.856e+04 -2.968e-02
## arrival_date_year stays_in_weekend_nights
## 9.223e+00 -5.020e-01
## stays_in_week_nights adults
## 7.309e-01 8.939e+00
## children babies
## 1.635e+01 2.405e+00
## previous_cancellations previous_bookings_not_canceled
## -1.033e+00 -3.069e-02
## booking_changes days_in_waiting_list
## 5.373e-01 7.820e-03
## required_car_parking_spaces total_of_special_requests
## 7.073e+00 2.630e+00
## `\\`hotel_City Hotel\\`` `\\`hotel_Resort Hotel\\``
## 2.712e+01 NA
## meal_BB meal_FB
## -3.003e+01 1.372e+01
## meal_HB meal_SC
## 7.488e-01 -4.365e+01
## meal_Undefined market_segment_Aviation
## NA -1.347e+01
## market_segment_Complementary market_segment_Corporate
## -1.081e+02 -2.496e+01
## market_segment_Direct market_segment_Groups
## -5.579e-01 -3.099e+01
## `\\`market_segment_Offline TA/TO\\`` `\\`market_segment_Online TA\\``
## -2.460e+01 NA
## distribution_channel_Corporate distribution_channel_Direct
## 2.069e+01 1.633e+01
## distribution_channel_GDS `\\`distribution_channel_TA/TO\\``
## 3.800e+01 1.756e+01
## distribution_channel_Undefined is_repeated_guest_yes
## NA -1.615e+01
## reserved_room_type_A reserved_room_type_B
## -7.105e+01 -9.469e+01
## reserved_room_type_C reserved_room_type_D
## -2.744e+01 -4.631e+01
## reserved_room_type_E reserved_room_type_F
## -4.227e+01 -3.047e+01
## reserved_room_type_G reserved_room_type_H
## -1.903e+01 2.528e-01
## reserved_room_type_L assigned_room_type_A
## NA 1.341e+02
## assigned_room_type_B assigned_room_type_C
## 1.377e+02 1.366e+02
## assigned_room_type_D assigned_room_type_E
## 1.269e+02 1.351e+02
## assigned_room_type_F assigned_room_type_G
## 1.404e+02 1.436e+02
## assigned_room_type_H assigned_room_type_I
## 1.373e+02 8.811e+01
## assigned_room_type_K assigned_room_type_L
## 8.840e+01 NA
## `\\`deposit_type_No Deposit\\`` `\\`deposit_type_Non Refund\\``
## -1.357e+01 -3.137e-01
## deposit_type_Refundable customer_type_Contract
## NA -1.883e+00
## customer_type_Group customer_type_Transient
## -1.241e+01 -1.280e+00
## `\\`customer_type_Transient-Party\\``
## NA
qplot(model.mr[[2]], hotel.test[, 1]) +
labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
lims(x = c(0, 300), y = c(0, 300)) +
geom_abline(intercept = 0, slope = 1, colour = "blue")
## Warning: Removed 313 rows containing missing values (geom_point).
And here we see the huge overfitting that we have. It is tremendous. What about if we scale the data.
model.mr.scale = runCaret("lm", adr ~ ., hotel.train, test = hotel.test, preProcess = c("scale", "center"))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## [1] "lm used"
## Linear Regression
##
## 118888 samples
## 60 predictor
##
## Pre-processing: scaled (60), centered (60)
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 95110, 95109, 95111, 95112, 95110, 95110, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 36.12724 0.4355737 25.97483
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## [1] "R^2 of the testing is 0.410757105804979"
There is no improvement.
qplot(model.mr.scale[[2]], hotel.test[, 1]) +
labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
lims(x = c(0, 300), y = c(0, 300)) +
geom_abline(intercept = 0, slope = 1.25, colour = "blue")
## Warning: Removed 313 rows containing missing values (geom_point).
And the overfitting is the same.
Now let’s try to fit some other models with much less overfitting.
model.hmr = runCaret("lm", adr ~ lead_time + I(adults + children + babies) + `hotel_City Hotel` +
I(stays_in_weekend_nights + stays_in_week_nights),
hotel.train, test = hotel.test)
## [1] "lm used"
## Linear Regression
##
## 118888 samples
## 7 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 95112, 95111, 95109, 95109, 95111, 95110, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 43.6508 0.1794635 32.04992
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## [1] "R^2 of the testing is 0.172311291668375"
Here we see that the testing R² is terrible 0.17. So we won’t use this model. However, let’s see the relationships.
plot(varImp(model.hmr[[1]], scale = F), scales = list(y = list(cex = .95)))
Here we see that the number of people is the most important variable. However, due to the low R² we would not trust this model at all.
hotel.df$people = hotel.df$adults + hotel.df$children + hotel.df$babies
Just joined the three variables to see if it is a good predictor for our models.
hotel.df = hotel.df[, -which(names(hotel.df) == "hotel_Resort Hotel")]
We remove the Resort Hotel as it is the same as the City Hotel but inverse.
rm(list = ls())
load("bin/data_partition_regression_ready_feature.RData")
Now we load the libraries to speed up the process.
# fwdr_grid = expand.grid(nvmax = 4:(ncol(hotel.train)-10))
# model.fwdr = runCaret("leapForward", adr ~ ., hotel.train, test = hotel.test, grid = fwdr_grid)
load("bin/fwdr.RData")
The best model was achieved with 51 variables with a R² in the testing set of 0.41, same as with all the variables. Therefore this method is better than the first one. This is because with less variables we can achieve the same R². What R² means is that 41% of the variance on the test set is explained by the model. That is that 59% is explained by just noise. This is not very good for predictions!
So let’s see the variables that are important for this model.
plot(varImp(model.fwdr[[1]], scale = F), scales = list(y = list(cex = .95)))
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Here the most important variables are the reserved_room_type_A, the number of children and if the market segment is online. Now let’s see their coefficients and see if it is a positive or negative relationship to draw some conclusions.
model.fwdr[[1]]$bestTune
## nvmax
## 48 51
We see that the best tune is when it is 51.
sort(coef(model.fwdr[[1]]$finalModel, 51), decreasing = T)
## NULL
Here we see that reserved_room_type_A has a coefficient of -40, that is on average, a customer will pay 40 units less if they book a room type A in contrast to the normal.
Also, the number of children has a coefficient of 16 (positive), that is on average, booking a room for children is more expensive than just booking a room for adults (8). That is children pay 8 units more compared to adults.
This are really interesting conclusions as they are pretty accurate for what we expect. And have we decrease the overfitting?
qplot(model.fwdr[[2]], hotel.test[, 1]) +
labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
lims(x = c(0, 300), y = c(0, 300)) +
geom_abline(intercept = 0, slope = 1, colour = "blue")
## Warning: Removed 319 rows containing missing values (geom_point).
Slightly, but not a lot. The values that the observed are zero, probably are promotions as stated before, so in order to fix this we would need to as the manager of the data set to include another variable, called promotions to indicated when they are or not in order to predict more accurately. However, as this data set was not designed to predict this variable, then that is the reason why we get pretty low R².
Now let’s use some backwards regression, however we will expect to get the same results as forward regression.
bwdr_grid = expand.grid(nvmax = 4:(ncol(hotel.train)-10))
model.bwdr = runCaret("leapBackward", adr ~ ., hotel.train, test = hotel.test, grid = bwdr_grid)
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 9
## linear dependencies found
## Reordering variables and trying again:
## Warning in rval$lopt[] <- rval$vorder[rval$lopt]: number of items to replace is
## not a multiple of replacement length
## [1] "leapBackward used"
## Linear Regression with Backwards Selection
##
## 77281 samples
## 60 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 61824, 61825, 61826, 61825, 61824, 61825, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 4 42.63806 0.1635902 31.62946
## 5 42.07633 0.1854437 31.15672
## 6 41.95973 0.1899737 31.10753
## 7 41.80578 0.1959081 31.00519
## 8 41.12272 0.2219793 30.34479
## 9 40.21661 0.2567459 29.37830
## 10 39.46677 0.2841285 28.80734
## 11 39.19366 0.2939538 28.59910
## 12 39.11342 0.2968612 28.53906
## 13 39.03658 0.2996077 28.46156
## 14 38.31308 0.3250146 27.57253
## 15 38.24822 0.3273214 27.50214
## 16 38.13892 0.3310252 27.41621
## 17 37.58948 0.3500763 26.97205
## 18 37.56298 0.3510054 26.93853
## 19 37.44373 0.3551247 26.87642
## 20 37.16043 0.3648716 26.66441
## 21 37.08833 0.3673183 26.63213
## 22 36.17901 0.3978193 25.96245
## 23 36.10486 0.4003049 25.92161
## 24 36.08165 0.4010751 25.89831
## 25 36.04677 0.4022400 25.86691
## 26 36.03143 0.4027520 25.85509
## 27 36.00319 0.4036906 25.82611
## 28 35.97749 0.4045526 25.81342
## 29 35.94445 0.4056359 25.80400
## 30 35.93692 0.4058929 25.79347
## 31 35.92205 0.4063858 25.78218
## 32 35.90724 0.4068751 25.77032
## 33 35.83519 0.4092052 25.71811
## 34 35.70060 0.4136036 25.63097
## 35 35.38170 0.4240599 25.44348
## 36 35.32244 0.4260179 25.40520
## 37 35.20815 0.4297349 25.33339
## 38 35.17713 0.4307274 25.30880
## 39 35.08306 0.4337858 25.22767
## 40 35.05179 0.4348018 25.21287
## 41 35.02067 0.4358040 25.18366
## 42 34.99394 0.4366723 25.16937
## 43 34.95816 0.4378258 25.14809
## 44 34.94799 0.4381537 25.13851
## 45 34.94187 0.4383477 25.13046
## 46 34.93148 0.4386820 25.12765
## 47 34.88139 0.4403082 25.06850
## 48 34.81646 0.4423915 24.99902
## 49 34.80408 0.4427919 24.98899
## 50 34.79205 0.4431723 24.98484
## 51 34.78041 0.4435399 24.97422
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was nvmax = 51.
## [1] "R^2 of the testing is 0.399817616910282"
The same results we get as Forward Regression so there is no meaningful information we can extract form here. However let’s see if the most important variables are the same.
plot(varImp(model.bwdr[[1]], scale = F), scales = list(y = list(cex = .95)))
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
And they are exactly the same as expected. For this reason, we won’t be using Stepwise Regression, as we will get the exact same results. The main conclusion that we get apart from some conclusions, is that this data set is not prepared to try to predict the variable adr and that is why we need as many as possible variables.
Let’s create a model with less variables and try to get a similar R². Let’s use the top 20 aprox variables. Let’s try to reduce the overfitting.
custom.model = adr ~ reserved_room_type_A + children + `market_segment_Online TA` + adults +
assigned_room_type_A + market_segment_Groups + assigned_room_type_G + customer_type_Transient + assigned_room_type_F +
total_of_special_requests + reserved_room_type_D
model.custom.lr = runCaret("lm", custom.model,
hotel.train, test = hotel.test)
## [1] "lm used"
## Linear Regression
##
## 77281 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 61826, 61826, 61824, 61824, 61824, 61826, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 39.39202 0.2866397 29.1128
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## [1] "R^2 of the testing is 0.26355102272907"
It is pretty terrible to predict, therefore let’s try to make some modifications.
custom.model = adr ~ reserved_room_type_A + children + `market_segment_Online TA` + adults +
assigned_room_type_A + market_segment_Groups + assigned_room_type_G + customer_type_Transient + assigned_room_type_F +
total_of_special_requests + reserved_room_type_D + market_segment_Complementary + distribution_channel_Corporate
model.custom.lr = runCaret("lm", custom.model,
hotel.train, test = hotel.test)
## [1] "lm used"
## Linear Regression
##
## 77281 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 61826, 61824, 61824, 61825, 61825, 61824, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 38.54242 0.3166468 28.64607
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## [1] "R^2 of the testing is 0.286403596991798"
We are getting the same results as we got with forward regression. If we add more variables, the R² increases. Therefore for this example let’s use all the variables to predict adr. From a statistical view, it is pretty bad approach as we are adding so much noise to the model (overfitting). However, when we try to predict each variable, we see that it is the best way for this data set. The main reason is that this data set was prepared to classify customers if they were going to cancel their reservation or not, not to predict the average price per night, as this factor depends on so many things that are not expressed in the data set.
What about ridge regression. Is the penalization of higher dimension better so we do not end up with 51 variables in our model? Let’s see.
#ridge_grid = expand.grid(lambda = seq(0, .1, length = 100))
#model.ridge = runCaret("ridge", adr ~ ., hotel.train, test = hotel.test, grid = ridge_grid)
Something is wrong; all the RMSE metric values are missing:
RMSE Rsquared MAE
Min. : NA Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA Median : NA
Mean :NaN Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA Max. : NA
NA's :100 NA's :100 NA's :100
Error: Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
It is a shame but after trying multiple ways ridge regression, it did not work. The main problem, I think is beacuse of huge amount of dummy variables that this data set has, that if in a cross validation it did not get the whole value then it gives this error. I have tried removing, shinking the data, column wise or row wise but nothing.
What about not creating any dummy variables and let then caret take care of it.
#rm(list = ls())
#load("bin/data_partition_regression_ready_feature_no_dummy.RData")
#ridge_grid = expand.grid(lambda = seq(0, .1, length = 10))
#model.ridge = runCaret("ridge", adr ~ ., hotel.train, test = hotel.test, grid = ridge_grid)
And nothing, it keeps breaking so let’s skip it. It is a shame because ridge regression was really promising to reduce the dimensionality of our best model as it penalizes high dimension models. However, due to the enormous data set we are not able to perform it.
#lasso_grid = expand.grid(fraction = seq(0, .1, length = 100))
#model.lasso = runCaret("lasso", adr ~ ., hotel.train[1:10000, ], test = hotel.test, grid = lasso_grid)
Same as ridge regression, as lasso is pretty similar, just not doing the squared normed of the beta for the penalization part, it does not work either. So let’s see if Elastic Networks works for us.
#elastic_grid = expand.grid(alpha = seq(0, .2, 0.01), lambda = seq(0, .1, 0.01))
#model.elastic = runCaret("glmnet", adr ~ ., hotel.train, test = hotel.test, grid = elastic_grid)
load("bin/elastic.RData")
After computing the model we see that R² is just 0.4, same as the first model that we had.
plot(model.elastic[[1]])
model.elastic[[1]]$bestTune
## alpha lambda
## 176 0.15 0.1
And this makes our hypothesis stronger. It looks like the best mixing % is 0, that is the model needs as many variables as possible to improve its R². Also all the regularization parameters are the same, and that is really interesting (all the lambdas). That is that this data set is not really well suited to predict the adr variable, but let’s see if we can draw some conclusions out of this model.
plot(varImp(model.elastic[[1]], scale = F), scales = list(y = list(cex = .95)))
For this model the most important variables are the type of rooms a customer is assigned or reserved. This is really interesting as we saw that type A rooms were the cheapest ones but now lets see this ones.
coef(model.elastic[[1]]$finalModel)[1:10, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 's0', 's1', 's2' ... ]]
##
## (Intercept) 104.3174 103.7061 103.1868088 102.710528
## lead_time . . . .
## arrival_date_year . . . .
## stays_in_weekend_nights . . . .
## stays_in_week_nights . . . .
## adults . . . .
## children . . 0.1590313 1.073439
## babies . . . .
## previous_cancellations . . . .
## previous_bookings_not_canceled . . . .
##
## (Intercept) 101.997133 101.189070 100.401587 99.637521
## lead_time . . . .
## arrival_date_year . . . .
## stays_in_weekend_nights . . . .
## stays_in_week_nights . . . .
## adults . . . .
## children 1.957537 2.815247 3.653877 4.469494
## babies . . . .
## previous_cancellations . . . .
## previous_bookings_not_canceled . . . .
##
## (Intercept) -349.2796720 -1351.551356
## lead_time . .
## arrival_date_year 0.2222897 0.719040
## stays_in_weekend_nights . .
## stays_in_week_nights . .
## adults . .
## children 5.2541012 5.913572
## babies . .
## previous_cancellations . .
## previous_bookings_not_canceled . .
This does not give us any meaningful data because of the insane amount of predictors that we have, therefore let’s use our simple model.
qplot(model.elastic[[2]], hotel.test[, 1]) +
labs(title="Linear Regression Observed VS Predicted", x="Predicted", y="Observed") +
lims(x = c(0, 300), y = c(0, 300)) +
geom_abline(intercept = 0, slope = 1, colour = "blue")
## Warning: Removed 316 rows containing missing values (geom_point).
Woh! Here we see some improvemnt in the overfitting. It is true that the models keep trying to predict less than it really is. That is the models under price on average. However, there is less overfitting.
elastic_grid = expand.grid(alpha = seq(0, .2, 0.01), lambda = seq(0, .1, 0.01))
model.elastic.simple = runCaret("glmnet", custom.model, hotel.train, test = hotel.test, grid = elastic_grid)
## [1] "glmnet used"
## glmnet
##
## 77281 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 61825, 61825, 61824, 61826, 61824, 61824, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.00 0.00 38.58005 0.3155642 28.64729
## 0.00 0.01 38.58005 0.3155642 28.64729
## 0.00 0.02 38.58005 0.3155642 28.64729
## 0.00 0.03 38.58005 0.3155642 28.64729
## 0.00 0.04 38.58005 0.3155642 28.64729
## 0.00 0.05 38.58005 0.3155642 28.64729
## 0.00 0.06 38.58005 0.3155642 28.64729
## 0.00 0.07 38.58005 0.3155642 28.64729
## 0.00 0.08 38.58005 0.3155642 28.64729
## 0.00 0.09 38.58005 0.3155642 28.64729
## 0.00 0.10 38.58005 0.3155642 28.64729
## 0.01 0.00 38.56527 0.3159801 28.64774
## 0.01 0.01 38.56527 0.3159801 28.64774
## 0.01 0.02 38.56527 0.3159801 28.64774
## 0.01 0.03 38.56527 0.3159801 28.64774
## 0.01 0.04 38.56527 0.3159801 28.64774
## 0.01 0.05 38.56527 0.3159801 28.64774
## 0.01 0.06 38.56527 0.3159801 28.64774
## 0.01 0.07 38.56527 0.3159801 28.64774
## 0.01 0.08 38.56527 0.3159801 28.64774
## 0.01 0.09 38.56527 0.3159801 28.64774
## 0.01 0.10 38.56527 0.3159801 28.64774
## 0.02 0.00 38.56530 0.3159783 28.64772
## 0.02 0.01 38.56530 0.3159783 28.64772
## 0.02 0.02 38.56530 0.3159783 28.64772
## 0.02 0.03 38.56530 0.3159783 28.64772
## 0.02 0.04 38.56530 0.3159783 28.64772
## 0.02 0.05 38.56530 0.3159783 28.64772
## 0.02 0.06 38.56530 0.3159783 28.64772
## 0.02 0.07 38.56530 0.3159783 28.64772
## 0.02 0.08 38.56530 0.3159783 28.64772
## 0.02 0.09 38.56530 0.3159783 28.64772
## 0.02 0.10 38.56530 0.3159783 28.64772
## 0.03 0.00 38.56532 0.3159777 28.64769
## 0.03 0.01 38.56532 0.3159777 28.64769
## 0.03 0.02 38.56532 0.3159777 28.64769
## 0.03 0.03 38.56532 0.3159777 28.64769
## 0.03 0.04 38.56532 0.3159777 28.64769
## 0.03 0.05 38.56532 0.3159777 28.64769
## 0.03 0.06 38.56532 0.3159777 28.64769
## 0.03 0.07 38.56532 0.3159777 28.64769
## 0.03 0.08 38.56532 0.3159777 28.64769
## 0.03 0.09 38.56532 0.3159777 28.64769
## 0.03 0.10 38.56532 0.3159777 28.64769
## 0.04 0.00 38.56535 0.3159786 28.64763
## 0.04 0.01 38.56535 0.3159786 28.64763
## 0.04 0.02 38.56535 0.3159786 28.64763
## 0.04 0.03 38.56535 0.3159786 28.64763
## 0.04 0.04 38.56535 0.3159786 28.64763
## 0.04 0.05 38.56535 0.3159786 28.64763
## 0.04 0.06 38.56535 0.3159786 28.64763
## 0.04 0.07 38.56535 0.3159786 28.64763
## 0.04 0.08 38.56535 0.3159786 28.64763
## 0.04 0.09 38.56535 0.3159786 28.64763
## 0.04 0.10 38.56535 0.3159786 28.64763
## 0.05 0.00 38.56536 0.3159782 28.64762
## 0.05 0.01 38.56536 0.3159782 28.64762
## 0.05 0.02 38.56536 0.3159782 28.64762
## 0.05 0.03 38.56536 0.3159782 28.64762
## 0.05 0.04 38.56536 0.3159782 28.64762
## 0.05 0.05 38.56536 0.3159782 28.64762
## 0.05 0.06 38.56536 0.3159782 28.64762
## 0.05 0.07 38.56536 0.3159782 28.64762
## 0.05 0.08 38.56536 0.3159782 28.64762
## 0.05 0.09 38.56536 0.3159782 28.64762
## 0.05 0.10 38.56536 0.3159782 28.64762
## 0.06 0.00 38.56540 0.3159766 28.64755
## 0.06 0.01 38.56540 0.3159766 28.64755
## 0.06 0.02 38.56540 0.3159766 28.64755
## 0.06 0.03 38.56540 0.3159766 28.64755
## 0.06 0.04 38.56540 0.3159766 28.64755
## 0.06 0.05 38.56540 0.3159766 28.64755
## 0.06 0.06 38.56540 0.3159766 28.64755
## 0.06 0.07 38.56540 0.3159766 28.64755
## 0.06 0.08 38.56540 0.3159766 28.64755
## 0.06 0.09 38.56540 0.3159766 28.64755
## 0.06 0.10 38.56540 0.3159766 28.64755
## 0.07 0.00 38.56538 0.3159773 28.64757
## 0.07 0.01 38.56538 0.3159773 28.64757
## 0.07 0.02 38.56538 0.3159773 28.64757
## 0.07 0.03 38.56538 0.3159773 28.64757
## 0.07 0.04 38.56538 0.3159773 28.64757
## 0.07 0.05 38.56538 0.3159773 28.64757
## 0.07 0.06 38.56538 0.3159773 28.64757
## 0.07 0.07 38.56538 0.3159773 28.64757
## 0.07 0.08 38.56538 0.3159773 28.64757
## 0.07 0.09 38.56538 0.3159773 28.64757
## 0.07 0.10 38.56538 0.3159773 28.64757
## 0.08 0.00 38.56538 0.3159749 28.64753
## 0.08 0.01 38.56538 0.3159749 28.64753
## 0.08 0.02 38.56538 0.3159749 28.64753
## 0.08 0.03 38.56538 0.3159749 28.64753
## 0.08 0.04 38.56538 0.3159749 28.64753
## 0.08 0.05 38.56538 0.3159749 28.64753
## 0.08 0.06 38.56538 0.3159749 28.64753
## 0.08 0.07 38.56538 0.3159749 28.64753
## 0.08 0.08 38.56538 0.3159749 28.64753
## 0.08 0.09 38.56538 0.3159749 28.64753
## 0.08 0.10 38.56538 0.3159749 28.64753
## 0.09 0.00 38.56538 0.3159745 28.64754
## 0.09 0.01 38.56538 0.3159745 28.64754
## 0.09 0.02 38.56538 0.3159745 28.64754
## 0.09 0.03 38.56538 0.3159745 28.64754
## 0.09 0.04 38.56538 0.3159745 28.64754
## 0.09 0.05 38.56538 0.3159745 28.64754
## 0.09 0.06 38.56538 0.3159745 28.64754
## 0.09 0.07 38.56538 0.3159745 28.64754
## 0.09 0.08 38.56538 0.3159745 28.64754
## 0.09 0.09 38.56538 0.3159745 28.64754
## 0.09 0.10 38.56538 0.3159745 28.64754
## 0.10 0.00 38.56540 0.3159763 28.64751
## 0.10 0.01 38.56540 0.3159763 28.64751
## 0.10 0.02 38.56540 0.3159763 28.64751
## 0.10 0.03 38.56540 0.3159763 28.64751
## 0.10 0.04 38.56540 0.3159763 28.64751
## 0.10 0.05 38.56540 0.3159763 28.64751
## 0.10 0.06 38.56540 0.3159763 28.64751
## 0.10 0.07 38.56540 0.3159763 28.64751
## 0.10 0.08 38.56540 0.3159763 28.64751
## 0.10 0.09 38.56540 0.3159763 28.64751
## 0.10 0.10 38.56540 0.3159763 28.64751
## 0.11 0.00 38.56544 0.3159751 28.64741
## 0.11 0.01 38.56544 0.3159751 28.64741
## 0.11 0.02 38.56544 0.3159751 28.64741
## 0.11 0.03 38.56544 0.3159751 28.64741
## 0.11 0.04 38.56544 0.3159751 28.64741
## 0.11 0.05 38.56544 0.3159751 28.64741
## 0.11 0.06 38.56544 0.3159751 28.64741
## 0.11 0.07 38.56544 0.3159751 28.64741
## 0.11 0.08 38.56544 0.3159751 28.64741
## 0.11 0.09 38.56544 0.3159751 28.64741
## 0.11 0.10 38.56544 0.3159751 28.64741
## 0.12 0.00 38.56538 0.3159737 28.64763
## 0.12 0.01 38.56538 0.3159737 28.64763
## 0.12 0.02 38.56538 0.3159737 28.64763
## 0.12 0.03 38.56538 0.3159737 28.64763
## 0.12 0.04 38.56538 0.3159737 28.64763
## 0.12 0.05 38.56538 0.3159737 28.64763
## 0.12 0.06 38.56538 0.3159737 28.64763
## 0.12 0.07 38.56538 0.3159737 28.64763
## 0.12 0.08 38.56538 0.3159737 28.64763
## 0.12 0.09 38.56538 0.3159737 28.64763
## 0.12 0.10 38.56538 0.3159737 28.64763
## 0.13 0.00 38.56540 0.3159757 28.64749
## 0.13 0.01 38.56540 0.3159757 28.64749
## 0.13 0.02 38.56540 0.3159757 28.64749
## 0.13 0.03 38.56540 0.3159757 28.64749
## 0.13 0.04 38.56540 0.3159757 28.64749
## 0.13 0.05 38.56540 0.3159757 28.64749
## 0.13 0.06 38.56540 0.3159757 28.64749
## 0.13 0.07 38.56540 0.3159757 28.64749
## 0.13 0.08 38.56540 0.3159757 28.64749
## 0.13 0.09 38.56540 0.3159757 28.64749
## 0.13 0.10 38.56540 0.3159757 28.64749
## 0.14 0.00 38.56541 0.3159781 28.64758
## 0.14 0.01 38.56541 0.3159781 28.64758
## 0.14 0.02 38.56541 0.3159781 28.64758
## 0.14 0.03 38.56541 0.3159781 28.64758
## 0.14 0.04 38.56541 0.3159781 28.64758
## 0.14 0.05 38.56541 0.3159781 28.64758
## 0.14 0.06 38.56541 0.3159781 28.64758
## 0.14 0.07 38.56541 0.3159781 28.64758
## 0.14 0.08 38.56541 0.3159781 28.64758
## 0.14 0.09 38.56541 0.3159781 28.64758
## 0.14 0.10 38.56541 0.3159781 28.64758
## 0.15 0.00 38.56544 0.3159739 28.64754
## 0.15 0.01 38.56544 0.3159739 28.64754
## 0.15 0.02 38.56544 0.3159739 28.64754
## 0.15 0.03 38.56544 0.3159739 28.64754
## 0.15 0.04 38.56544 0.3159739 28.64754
## 0.15 0.05 38.56544 0.3159739 28.64754
## 0.15 0.06 38.56544 0.3159739 28.64754
## 0.15 0.07 38.56544 0.3159739 28.64754
## 0.15 0.08 38.56544 0.3159739 28.64754
## 0.15 0.09 38.56544 0.3159739 28.64754
## 0.15 0.10 38.56544 0.3159739 28.64754
## 0.16 0.00 38.56545 0.3159769 28.64753
## 0.16 0.01 38.56545 0.3159769 28.64753
## 0.16 0.02 38.56545 0.3159769 28.64753
## 0.16 0.03 38.56545 0.3159769 28.64753
## 0.16 0.04 38.56545 0.3159769 28.64753
## 0.16 0.05 38.56545 0.3159769 28.64753
## 0.16 0.06 38.56545 0.3159769 28.64753
## 0.16 0.07 38.56545 0.3159769 28.64753
## 0.16 0.08 38.56545 0.3159769 28.64753
## 0.16 0.09 38.56545 0.3159769 28.64753
## 0.16 0.10 38.56545 0.3159769 28.64753
## 0.17 0.00 38.56541 0.3159730 28.64773
## 0.17 0.01 38.56541 0.3159730 28.64773
## 0.17 0.02 38.56541 0.3159730 28.64773
## 0.17 0.03 38.56541 0.3159730 28.64773
## 0.17 0.04 38.56541 0.3159730 28.64773
## 0.17 0.05 38.56541 0.3159730 28.64773
## 0.17 0.06 38.56541 0.3159730 28.64773
## 0.17 0.07 38.56541 0.3159730 28.64773
## 0.17 0.08 38.56541 0.3159730 28.64773
## 0.17 0.09 38.56541 0.3159730 28.64773
## 0.17 0.10 38.56541 0.3159730 28.64773
## 0.18 0.00 38.56543 0.3159717 28.64745
## 0.18 0.01 38.56543 0.3159717 28.64745
## 0.18 0.02 38.56543 0.3159717 28.64745
## 0.18 0.03 38.56543 0.3159717 28.64745
## 0.18 0.04 38.56543 0.3159717 28.64745
## 0.18 0.05 38.56543 0.3159717 28.64745
## 0.18 0.06 38.56543 0.3159717 28.64745
## 0.18 0.07 38.56543 0.3159717 28.64745
## 0.18 0.08 38.56543 0.3159717 28.64745
## 0.18 0.09 38.56543 0.3159717 28.64745
## 0.18 0.10 38.56543 0.3159717 28.64745
## 0.19 0.00 38.56542 0.3159769 28.64758
## 0.19 0.01 38.56542 0.3159769 28.64758
## 0.19 0.02 38.56542 0.3159769 28.64758
## 0.19 0.03 38.56542 0.3159769 28.64758
## 0.19 0.04 38.56542 0.3159769 28.64758
## 0.19 0.05 38.56542 0.3159769 28.64758
## 0.19 0.06 38.56542 0.3159769 28.64758
## 0.19 0.07 38.56542 0.3159769 28.64758
## 0.19 0.08 38.56542 0.3159769 28.64758
## 0.19 0.09 38.56542 0.3159769 28.64758
## 0.19 0.10 38.56542 0.3159769 28.64758
## 0.20 0.00 38.56541 0.3159727 28.64773
## 0.20 0.01 38.56541 0.3159727 28.64773
## 0.20 0.02 38.56541 0.3159727 28.64773
## 0.20 0.03 38.56541 0.3159727 28.64773
## 0.20 0.04 38.56541 0.3159727 28.64773
## 0.20 0.05 38.56541 0.3159727 28.64773
## 0.20 0.06 38.56541 0.3159727 28.64773
## 0.20 0.07 38.56541 0.3159727 28.64773
## 0.20 0.08 38.56541 0.3159727 28.64773
## 0.20 0.09 38.56541 0.3159727 28.64773
## 0.20 0.10 38.56541 0.3159727 28.64773
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.01 and lambda = 0.1.
## [1] "R^2 of the testing is 0.286151472151666"
Oh this is really terrible. This model is 25% worse than the previous, however let’s see if we can draw any assumptions.
plot(varImp(model.elastic.simple[[1]], scale = F), scales = list(y = list(cex = .95)))
Here we see that again, elastic net focuses more on the room type and market segment. So let’s take a look at the coefficients.
coef(model.elastic.simple[[1]]$finalModel)[1:10, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 's0', 's1', 's2' ... ]]
##
## (Intercept) 104.3174 104.38957956 104.45922206 104.5195607
## reserved_room_type_A . -0.09977738 -0.20808248 -0.3259203
## children . . 0.08323533 0.2008870
## `market_segment_Online TA` . . . 0.0252787
## adults . . . .
## assigned_room_type_A . . . .
## market_segment_Groups . . . .
## assigned_room_type_G . . . .
## customer_type_Transient . . . .
## assigned_room_type_F . . . .
##
## (Intercept) 104.5538204 104.61556715 104.57760742 104.53220829
## reserved_room_type_A -0.4533059 -0.59063064 -0.73702093 -0.89511598
## children 0.3282879 0.46646857 0.61583106 0.77768863
## `market_segment_Online TA` 0.1147508 0.21183976 0.31645884 0.42979645
## adults . . 0.07044053 0.14734262
## assigned_room_type_A . -0.04015922 -0.12753553 -0.22130944
## market_segment_Groups . . . -0.01195244
## assigned_room_type_G . . . 0.09595685
## customer_type_Transient . . . .
## assigned_room_type_F . . . 0.06913068
##
## (Intercept) 104.4827388 104.33217943
## reserved_room_type_A -1.0619217 -1.23758418
## children 0.9474354 1.13085906
## `market_segment_Online TA` 0.5500779 0.67609141
## adults 0.2299825 0.31786653
## assigned_room_type_A -0.3185852 -0.41980711
## market_segment_Groups -0.1272397 -0.24698813
## assigned_room_type_G 0.4000957 0.72689892
## customer_type_Transient . 0.07322347
## assigned_room_type_F 0.3096616 0.56764520
Here we see the following:
Market_segment_Complementary (strongly negative): we see than on average the market segment average night price is 25 price units less than the average. This can be because this segment goes to the room_type_A (cheaper as we saw) or it has some special deal with the hotel.
Reserved_room_type_A (negative): on average this type of room is 5 unit price less than the averse, as we knew.
Children (positive): as we saw children are more expensive than adults.
So the conclusions that we got to understand the price of a specific customer are:
The room type A is the cheapest.
Children tend to pay more on average.
The market segment complementary just books all the type A rooms or as some special deals with the hotel. That is becuase is the segment with the lowest average adr.
Now, will the graphs say the same as our conclusions?
ggplot(hotel.train) + aes(x = as.factor(assigned_room_type_A), y = adr) + geom_boxplot()
Awesome! Our first conclusion was correct, and we see that on average room type A are cheaper than the others.
ggplot(hotel.train) + aes(x = as.factor(children), y = adr) + geom_boxplot()
And also correct, the more children in a reservation the higher the price.
ggplot(hotel.train) + aes(x = as.factor(market_segment_Complementary), y = adr) + geom_boxplot()
And woh! that is a really big difference. We could further study why this segment has less adr on average, but that is out of scope. Now let’s try to predict better the adr variable (now we are doing a little bit poorly).
NOTE: if we use statistical tools we can see that this variables are the most important to predict the variable adr, however as is later shown; if we fit non-linear models (more flexible ones) we see that the number of people is the most important variable and also the room type A. If we use linear model they do not show that clearly because their relationship is not linear therefore is not noticible with statistical tools. However we could try to compute the second order polynomial or more to see if that is really the case. But let’s now try to predict.
For KNN we will standardized the data, because otherwise the year variable and other with high number will distrupt the small ones. It is always a good idea to standardized the data in KNN.
# knn.reg_grid = data.frame(kmax=c(11,13,15,19,21),distance=2, kernel='optimal')
# model.knn.reg = runCaret("kknn", adr ~ ., hotel.train, test = hotel.test, grid = knn.reg_grid, preProcess = c("center", "scale"))
load("bin/kknn.RData")
model.knn.reg[[3]]
## [1] 0.5288361
It is impressive that knn gives an R² of 0.53. It is a huge leap from what we where.
model.knn.reg[[1]]
## k-Nearest Neighbors
##
## 77281 samples
## 60 predictor
##
## Pre-processing: centered (60), scaled (60)
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 61825, 61825, 61824, 61825, 61825, 61825, ...
## Resampling results across tuning parameters:
##
## kmax RMSE Rsquared MAE
## 11 31.45667 0.5479302 19.61445
## 13 31.29816 0.5512924 19.66235
## 15 31.20029 0.5533680 19.72326
## 19 31.10693 0.5552679 19.86855
## 21 31.10041 0.5553931 19.88979
##
## Tuning parameter 'distance' was held constant at a value of 2
## Tuning
## parameter 'kernel' was held constant at a value of optimal
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were kmax = 21, distance = 2 and kernel
## = optimal.
Here we see that more neighbors that we choose the better. This is the same thing with our previous assumption that the more columns that we got the better. And our hypothesis is further proven, this data set is not suited to predict the adr and that is the reason why it needs so many neighbors. Also, we can derive from this, that due to the huge data set that we have, we need a lot of neighbours to predict more accurately. However, this k was the max that our computer could handle, otherwise it just kept breaking. But, this is the best model so far.
plot(model.knn.reg[[1]])
This graph shows what we had said previously.
plot(varImp(model.knn.reg[[1]], scale = T), scales = list(y = list(cex = .95)))
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
This are some really impressive results. It looks like the variable that we created, is the most important. So we could add to our hypothesis that the number of people for each reservation is highly correlated with the price on average. Also, as stated before, the room type A is pretty important, and the number of children.
#rf_grid = data.frame(mtry = seq(1, 15, 3))
#model.rf = runCaret("rf", adr ~ ., hotel.train, test = hotel.test, grid = rf_grid)
After multiple attempts trying to make it work, we have concluded that this data set is too complex for our computer to handle random forest,, therefore we will leave it for now.
# gradient_grid = expand.grid(nrounds = c(500,1000), max_depth = c(5,6,7), eta = c(0.01, 0.1, 1),
# gamma = c(1, 2, 3), colsample_bytree = c(1, 2),
# min_child_weight = c(1), subsample = c(0.2,0.5,0.8))))
#model.xgb = runCaret("xgbTree", adr ~ ., hotel.train, test = hotel.test, grid = gradient_grid)
Same as the Random Forest.
# svm_grid = expand.grid(cost = c(1, 2, 3))
# model.svm = runCaret("svmLinear2", adr ~ ., hotel.train, grid = svm_grid , test = hotel.test)
load("bin/svm_reg.RData")
model.svm[[1]]
## Support Vector Machines with Linear Kernel
##
## 77281 samples
## 60 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 61825, 61825, 61825, 61825, 61824, 61825, ...
## Resampling results across tuning parameters:
##
## cost RMSE Rsquared MAE
## 1 35.40016 0.4341135 24.26441
## 2 35.36536 0.4343760 24.32515
## 3 35.24342 0.4363186 24.26424
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cost = 3.
model.svm[[3]]
## [1] 0.3919574
Here we can see that the R² is just 0.39, that is a little bit disappointing however, let’s see the most important variables
plot(varImp(model.svm[[1]], scale = T), scales = list(y = list(cex = .95)))
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
And this is impressive! The most important variables are the same as knn, therefore we can conclude that we can add this variables to our hypothesis.
It is true that most of the models that we tried to use are not very good. The main reason is that this data set was not design to predict the average price per night of a hotel, however we have managed to get pretty good results with knn (as far as we could). We could have better results with other models such as random forest or gradient boosting, but we had computational constrains. We tried to reduce the dimensionality of the data set and also the number of rows, however we where unsuccessful. Now let’s try to fit the best three models and see if we can get better results.
Now, to make our final prediction, we will be using the best three models that in our case were: KNN, Elastic Net, and Forward Regression.
We are using the best three models to predict our final data set.
For this, let’s import the three models
rm(list = ls())
load("bin/data_partition_regression_ready_feature.RData")
load("bin/fwdr.RData")
load("bin/kknn.RData")
load("bin/elastic.RData")# to create
We are going to use the mean of the models.
finalPrediction = function(data) {
pred.fwdr = predict(model.fwdr[[1]], newdata = data)
pred.knn = predict(model.knn.reg[[1]], newdata = data)
pred.elastic = predict(model.elastic[[1]], newdata = data)
finalPred = apply(data.frame(pred.fwdr, pred.knn, pred.elastic), 1, mean)
return(finalPred)
}
Here we construct the function that will take care of predicting at each model and making the mode. So, now that that’s done, let’s use the final test data set and see how well can we predict the data.
pred.final = finalPrediction(hotel.test.final[, -1])
cor(hotel.test.final[, 1], pred.final)^2
## [1] 0.5539924
And as we saw with classification, doing the ensemble improves by a lot the prediction. It increased 10% the R². Now, let’s see our errors.
ggplot(as.data.frame(pred.final)) + aes(x = pred.final) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It looks pretty much like a normal slightly shifted to the left. We see that there are some predictions that are negative that we could round up to zero. And the most likely hypothesis is due to those observations where adr was 0.
So let’s try to compute some prediction intervals so we can assess a customer if the price that they are given is under priced, well priced, or over priced.
what about the errors?
error = hotel.test.final[, 1] - pred.final
ggplot(as.data.frame(error)) + aes(x = error) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is pretty good. We see that the error of our models is centered at 0 and follows more or less a normal distribution. That is pretty good for the prediction intervals as this will make our predictions more accurate for the lower part, but they will be higher for the upper part.
In order to make the prediction intervals, we will have to use half of the data to get the intervals, otherwise we will overfit the prediction and we won’t make the that accurate. We will use the other half to compute the lower and upper limits.
noise = error[1:(length(error) / 2)]
Because the right part is pretty large, we will compute the 80% confidence interval.
pred.interval.data = pred.final[((length(pred.final) / 2) + 1):length(pred.final)]
lwr = pred.interval.data - abs(quantile(noise,0.1, na.rm=T))
upr = pred.interval.data + quantile(noise,0.9, na.rm=T)
pred.interval.df = data.frame(real= hotel.test.final[((nrow(hotel.test.final) / 2) + 1):nrow(hotel.test.final), 1],
fit= pred.interval.data, lwr=lwr, upr=upr)
pred.interval.df = pred.interval.df %>% mutate(out=factor(if_else((real<lwr | real>upr),1,0)))
mean(pred.interval.df$out==1)
## [1] 0.06117185
So here 0.06% of customers were over priced or under priced according to our model.
ggplot(pred.interval.df, aes(x=fit, y=real))+
geom_point(aes(color=out)) + theme(legend.position="none") +
xlim(0, 300) + ylim(0, 300)+
geom_ribbon(data=pred.interval.df,aes(ymin=lwr,ymax=upr),alpha=0.3) +
labs(title = "Prediction intervals", x = "prediction",y="real price")
## Warning: Removed 83 rows containing missing values (geom_point).
Here we see that our ensamble model tends to under price the real adr, however it is pretty good. For this we can assess each customer if the are paying more or less than they should. Finally, we can see some zeros values on the real price axis and those are for sure promotions that the hotel runs or error that had so they did not charge reservations for those customers. We would need to add another variable according to this statement as said before to predict those.
After predicting the qualitative variable and the qualitative one, we have clearly seen that this data set is well fit for classification and not for regression. However, all in all we achieved some pretty good results.
Furthermore, we got some pretty god insights on the data and we could assess the hotels to improve their cancellation ratio and tell the customer how much a night should cost.